--- license: apache-2.0 library_name: pytorch tags: - modular-arithmetic - algorithmic-reasoning - rnn - number-theory - neural-algorithm --- # Horner-RNN — learned modular multiplication up to 2¹²⁸ A compliant **bit-sequential RNN** that computes `(a · b) mod p` for primes `p` up to **2¹²⁸**, by *learning the Horner step of double-and-add* rather than memorising multiplication tables. Entry for the [Modular Arithmetic Challenge](https://github.com/SAIRcompetition/modular-arithmetic-challenge). - **Saturates tiers 1–6** (all primes `< 2¹²⁸`): tiers 1–3 = 100%, tier 4 = 99%, tier 5 = 98%, **tier 6 = 97%** - **overall_accuracy 0.602**, `highest_tier_above_90 = 6` - The 128-bit (tier-6) cell is a **carry-aware TCN** (weight-shared dilated convolutions over the 128 bit-positions, ~3.9M params) — a far better inductive bias for long carry chains than the MLP - Verifiably **generalises to primes never seen in training** (held-out-prime validation accuracy tracks training accuracy — no memorisation gap) ## The idea Write `a` in bits, MSB-first; then `a·b mod p` is the iterate of one small map: ``` t_0 = 0 t_{k+1} = (2·t_k + a_bit_k · b) mod p # one learned step (Horner) answer = t_N (N = bit width of p) ``` The model is an RNN whose **transition function — an MLP — is trained on exactly that single-step map** over binary-encoded inputs. The hidden state is a quantized bit vector (a hard binary bottleneck), so the recurrence composes cleanly: if the cell is exact per step, the chain is exact end-to-end. At inference the scan feeds the bits of `a mod p` one per step, conditioned on `(b mod p, p)`, and the final hidden-state bits are emitted MSB-first as the base-2 answer (`output_base: 2`). The single-step function is **piecewise linear** (`2t + bit·b`, then subtract `0`, `p`, or `2p`), which is why it generalises across primes where the full bilinear map `(a,b) → a·b mod p` does not. ## Files / cells The model ships **four cells** and routes each problem to the narrowest one whose state holds the prime: | File | Cell | Primes | Tiers | Arch | Params | Public benchmark | |---|---|---|---|---|---|---| | `weights16.pt` | 16-bit | `< 2¹⁶` | 1–3 | MLP, 4096 / 4 | ~50M | tiers 1–3 = 1.00 | | `weights32.pt` | 32-bit | `< 2³²` | 4 | MLP, 6144 / 4 | ~114M | tier 4 = 0.99 | | `weights64.pt` | 64-bit | `< 2⁶⁴` | 5 | MLP, 4096 / 7, residual | ~236M | tier 5 = 0.98 | | `weights128.pt` | 128-bit | `< 2¹²⁸` | 6 | **carry-aware TCN**, 256ch / 10 blocks, dilations 1–64 | ~3.9M | **tier 6 = 0.97** | The 128-bit cell switches architecture: instead of a full-width MLP it is a **non-causal dilated 1-D convolutional network over the 128 bit-positions**. Carry propagation is *position-invariant* — the same carry/borrow rule applies at every bit — so a weight-shared convolution learns **one** rule applied everywhere (non-causal, so the addition carry flows LSB→MSB and the mod-`p` compare/borrow flows MSB→LSB), rather than an MLP learning 128 separate position-functions. This inductive bias drives the per-step error roughly **15× lower** than the same-task MLP, lifting tier 6 from 0.26 to **0.97** with a cell **~60× smaller** (16 MB vs ~950 MB). The 64-bit cell needs **depth and residual connections** the narrower cells do not: a 64-bit modular Horner step hides two long carry chains (the `2t + bit·b` addition and the compare-and-subtract reduction), and exact n-bit carry propagation wants MLP depth ~log₂(n). The last push from tier 5 = 0.74 to 0.98 came from training the 64-bit cell's single-step examples on the **states the chain actually visits** (the true Horner trajectory) rather than uniformly sampled `t` — see *Training*. For `p ≥ 2⁶⁴` the model emits the honest `[0]` fallback without invoking the network. Also in the repo: `model.py` (the `HornerRNN` entry class + `HornerCell`), `manifest.json` (challenge manifest), `train.py` (the 16-bit trainer). ## Usage This is a challenge submission; the base class lives in the challenge package, so install it first: ```bash pip install "git+https://github.com/SAIRcompetition/modular-arithmetic-challenge" ``` Direct inference: ```python import torch from model import HornerRNN # model.py from this repo m = HornerRNN() m.load(".") # auto-loads weights{16,32,64}.pt from this dir # returns base-2 digits, MSB-first; the harness decodes them to the integer digits = m.predict_digits_batch([(123456789, 987654321, 4294967291)])[0] answer = int("".join(map(str, digits)), 2) print(answer) # == (123456789 * 987654321) % 4294967291 ``` Or score it with the official harness: ```bash modchallenge evaluate . --total 1100 ``` ## Compliance (the rules permit *learned* algorithms, not hand-coded ones) The **scan** (tokenise `a mod p` into bits, iterate, read out the final state) is architecture — it computes nothing by itself. The **arithmetic** (doubling, conditional add, compare-against-`p`, carries) all lives in the trained cell weights; nothing in the code adds, multiplies, or compares against `p`. **Principle 2, measured** — perturbing the cell weights with Gaussian noise scaled to each tensor's std collapses accuracy toward the floor, and a fully re-initialised (untrained) cell is *at* the floor. The capability therefore resides in the trained parameters: | noise σ (×param std) | 0 | 0.05 | 0.1 | 0.25 | 0.5 | untrained | |---|---|---|---|---|---|---| | tier 3 (16-bit cell) | 1.00 | 1.00 | 0.98 | 0.74 | 0.06 | 0.00 | | tier 4 (32-bit cell) | 0.99 | 0.99 | 0.86 | 0.04 | 0.02 | 0.00 | | tier 5 (64-bit cell) | 0.98 | 0.95 | 0.65 | 0.03 | 0.01 | 0.00 | | tier 6 (128-bit TCN) | 0.97 | 0.96 | 0.98 | 0.19 | 0.02 | 0.00 | Generalisation against memorisation: 10% of primes at each bit-width were held out of training entirely; chain accuracy on them matches the training primes. ## Training Single-step examples `(t, bit, b, p) → (2t + bit·b) mod p` over each tier's prime range; BCE per state bit, AdamW + cosine decay + EMA, checkpointed by full-chain accuracy on held-out primes. The 64-bit cell adds a second fine-tuning phase whose single steps are drawn from the **true Horner trajectory** — `t` is an actual chain intermediate `(a_{≥i}·b) mod p`, not a uniform sample — matching the training distribution to the states the chain visits at inference. This lifts tier 5 from 0.74 to 0.98 with no capacity change and no backprop through the recurrence (ordinary supervised BCE on the same single-step target). The 128-bit (tier-6) cell is trained the same single-step way but as the **carry-aware TCN** over a high-diversity pool of thousands of distinct 124–128 bit primes; its weight-shared dilated-convolution bias reaches a per-step error ~15× lower than the same-task MLP, giving **tier 6 = 0.97** in a single short run. Training code and the full write-up live in the solutions repo (link in the model card metadata / challenge leaderboard). ## License Apache-2.0, matching the challenge.