---
license: apache-2.0
library_name: pytorch
tags:
  - modular-arithmetic
  - algorithmic-reasoning
  - rnn
  - number-theory
  - neural-algorithm
---

# Horner-RNN — learned modular multiplication up to 2¹²⁸

A compliant **bit-sequential RNN** that computes `(a · b) mod p` for primes `p` up to
**2¹²⁸**, by *learning the Horner step of double-and-add* rather than memorising
multiplication tables. Entry for the
[Modular Arithmetic Challenge](https://github.com/SAIRcompetition/modular-arithmetic-challenge).

- **Saturates tiers 1–6** (all primes `< 2¹²⁸`): tiers 1–3 = 100%, tier 4 = 99%, tier 5 = 98%, **tier 6 = 97%**
- **overall_accuracy 0.602**, `highest_tier_above_90 = 6`
- The 128-bit (tier-6) cell is a **carry-aware TCN** (weight-shared dilated convolutions over the
  128 bit-positions, ~3.9M params) — a far better inductive bias for long carry chains than the MLP
- Verifiably **generalises to primes never seen in training** (held-out-prime validation
  accuracy tracks training accuracy — no memorisation gap)

## The idea

Write `a` in bits, MSB-first; then `a·b mod p` is the iterate of one small map:

```
t_0 = 0
t_{k+1} = (2·t_k + a_bit_k · b) mod p      # one learned step (Horner)
answer  = t_N           (N = bit width of p)
```

The model is an RNN whose **transition function — an MLP — is trained on exactly that
single-step map** over binary-encoded inputs. The hidden state is a quantized bit vector
(a hard binary bottleneck), so the recurrence composes cleanly: if the cell is exact per
step, the chain is exact end-to-end. At inference the scan feeds the bits of `a mod p` one
per step, conditioned on `(b mod p, p)`, and the final hidden-state bits are emitted
MSB-first as the base-2 answer (`output_base: 2`).

The single-step function is **piecewise linear** (`2t + bit·b`, then subtract `0`, `p`, or
`2p`), which is why it generalises across primes where the full bilinear map
`(a,b) → a·b mod p` does not.

## Files / cells

The model ships **four cells** and routes each problem to the narrowest one whose state
holds the prime:

| File | Cell | Primes | Tiers | Arch | Params | Public benchmark |
|---|---|---|---|---|---|---|
| `weights16.pt` | 16-bit | `< 2¹⁶` | 1–3 | MLP, 4096 / 4 | ~50M | tiers 1–3 = 1.00 |
| `weights32.pt` | 32-bit | `< 2³²` | 4 | MLP, 6144 / 4 | ~114M | tier 4 = 0.99 |
| `weights64.pt` | 64-bit | `< 2⁶⁴` | 5 | MLP, 4096 / 7, residual | ~236M | tier 5 = 0.98 |
| `weights128.pt` | 128-bit | `< 2¹²⁸` | 6 | **carry-aware TCN**, 256ch / 10 blocks, dilations 1–64 | ~3.9M | **tier 6 = 0.97** |

The 128-bit cell switches architecture: instead of a full-width MLP it is a **non-causal
dilated 1-D convolutional network over the 128 bit-positions**. Carry propagation is
*position-invariant* — the same carry/borrow rule applies at every bit — so a weight-shared
convolution learns **one** rule applied everywhere (non-causal, so the addition carry flows
LSB→MSB and the mod-`p` compare/borrow flows MSB→LSB), rather than an MLP learning 128 separate
position-functions. This inductive bias drives the per-step error roughly **15× lower** than the
same-task MLP, lifting tier 6 from 0.26 to **0.97** with a cell **~60× smaller** (16 MB vs ~950 MB).

The 64-bit cell needs **depth and residual connections** the narrower cells do not: a 64-bit
modular Horner step hides two long carry chains (the `2t + bit·b` addition and the
compare-and-subtract reduction), and exact n-bit carry propagation wants MLP depth ~log₂(n).
The last push from tier 5 = 0.74 to 0.98 came from training the 64-bit cell's single-step
examples on the **states the chain actually visits** (the true Horner trajectory) rather than
uniformly sampled `t` — see *Training*. For `p ≥ 2⁶⁴` the model emits the honest `[0]`
fallback without invoking the network.

Also in the repo: `model.py` (the `HornerRNN` entry class + `HornerCell`), `manifest.json`
(challenge manifest), `train.py` (the 16-bit trainer).

## Usage

This is a challenge submission; the base class lives in the challenge package, so install it
first:

```bash
pip install "git+https://github.com/SAIRcompetition/modular-arithmetic-challenge"
```

Direct inference:

```python
import torch
from model import HornerRNN          # model.py from this repo

m = HornerRNN()
m.load(".")                          # auto-loads weights{16,32,64}.pt from this dir
# returns base-2 digits, MSB-first; the harness decodes them to the integer
digits = m.predict_digits_batch([(123456789, 987654321, 4294967291)])[0]
answer = int("".join(map(str, digits)), 2)
print(answer)                        # == (123456789 * 987654321) % 4294967291
```

Or score it with the official harness:

```bash
modchallenge evaluate . --total 1100
```

## Compliance (the rules permit *learned* algorithms, not hand-coded ones)

The **scan** (tokenise `a mod p` into bits, iterate, read out the final state) is
architecture — it computes nothing by itself. The **arithmetic** (doubling, conditional
add, compare-against-`p`, carries) all lives in the trained cell weights; nothing in the
code adds, multiplies, or compares against `p`.

**Principle 2, measured** — perturbing the cell weights with Gaussian noise scaled to each
tensor's std collapses accuracy toward the floor, and a fully re-initialised (untrained)
cell is *at* the floor. The capability therefore resides in the trained parameters:

| noise σ (×param std) | 0 | 0.05 | 0.1 | 0.25 | 0.5 | untrained |
|---|---|---|---|---|---|---|
| tier 3 (16-bit cell) | 1.00 | 1.00 | 0.98 | 0.74 | 0.06 | 0.00 |
| tier 4 (32-bit cell) | 0.99 | 0.99 | 0.86 | 0.04 | 0.02 | 0.00 |
| tier 5 (64-bit cell) | 0.98 | 0.95 | 0.65 | 0.03 | 0.01 | 0.00 |
| tier 6 (128-bit TCN) | 0.97 | 0.96 | 0.98 | 0.19 | 0.02 | 0.00 |

Generalisation against memorisation: 10% of primes at each bit-width were held out of
training entirely; chain accuracy on them matches the training primes.

## Training

Single-step examples `(t, bit, b, p) → (2t + bit·b) mod p` over each tier's prime range; BCE
per state bit, AdamW + cosine decay + EMA, checkpointed by full-chain accuracy on held-out
primes. The 64-bit cell adds a second fine-tuning phase whose single steps are drawn from the
**true Horner trajectory** — `t` is an actual chain intermediate `(a_{≥i}·b) mod p`, not a
uniform sample — matching the training distribution to the states the chain visits at
inference. This lifts tier 5 from 0.74 to 0.98 with no capacity change and no backprop through
the recurrence (ordinary supervised BCE on the same single-step target). The 128-bit (tier-6)
cell is trained the same single-step way but as the **carry-aware TCN** over a high-diversity
pool of thousands of distinct 124–128 bit primes; its weight-shared dilated-convolution bias
reaches a per-step error ~15× lower than the same-task MLP, giving **tier 6 = 0.97** in a single
short run. Training code and the full write-up live in the solutions repo (link in the model card
metadata / challenge leaderboard).

## License

Apache-2.0, matching the challenge.