{
  "entry_class": "model.HornerRNN",
  "output_base": 2,
  "framework": "pytorch",
  "model_description": "Bit-sequential RNN (~410M params across five cells) for primes up to 2^256. Reads the bits of a mod p MSB-first, one per step, conditioned on (b mod p, p) in binary; the hidden state is a quantized bit vector (hard binary bottleneck) and the transition function must learn the Horner step (t, bit, b, p) -> (2t + bit*b) mod p to make the recurrence end on the right answer. Five cells are shipped and routed by prime size: a 16-bit cell (MLP, width 4096 depth 4, ~50M params) for p < 2^16 covering tiers 1-3, a 32-bit cell (MLP, width 6144 depth 4, ~114M params) for p < 2^32 covering tier 4, a 64-bit cell (MLP, width 4096 depth 7 with pre-norm residual blocks, ~236M params) for p < 2^64 covering tier 5, a 128-bit cell for p < 2^128 covering tier 6 that is a CARRY-AWARE TCN: a non-causal dilated 1D-convolutional network over the 128 bit-positions (10 residual blocks, 256 channels, dilations cycling 1..64 so the receptive field spans all 128 bits, ~3.9M params), and a 256-bit cell for p < 2^256 covering tier 7 that uses the SAME carry-aware TCN architecture scaled to 256 bit-positions (12 residual blocks, 256 channels, dilations cycling 1..128 so the receptive field spans all 256 bits, ~4.7M params), lifting tier 7 to 0.98. The convolution is weight-shared across bit positions, so it learns ONE carry/borrow rule applied everywhere (non-causally, so the addition carry can flow LSB->MSB and the mod-p compare/borrow MSB->LSB) instead of a full-width MLP learning a separate position-function per bit; this inductive bias drives the per-step error far below what an MLP cell reaches and is what makes the 128- and 256-bit chains (which compound the per-step error over 128 and 256 steps) accurate. Final state bits are emitted MSB-first as the base-2 answer. For p >= 2^256 emits the honest [0] fallback without invoking the network.",
  "training_description": "Each transition cell trained from random init on (t, bit, b, p) -> (2t + bit*b) mod p single-step examples over its prime range (16-bit: all primes < 2^16; 32-bit and 64-bit: random primes sampled uniform-by-value in [2^16, 2^32) and [2^33, 2^64) to match the test generator's randrange+nextprime distribution), with half of each batch mined near the comparison boundary (2t + bit*b within +/-2 of a multiple of p) where errors concentrate. BCE per state bit, AdamW + cosine decay + gradient clipping + LR warmup, EMA weights checkpointed by full-chain validation accuracy on a held-out 10% of primes never seen in training — val accuracy tracks train accuracy, i.e. the cells generalise across primes rather than memorising them. The 64-bit cell additionally receives a second fine-tuning phase on single steps drawn from the TRUE Horner trajectory (each example is a (t, bit, b, p) -> (2t + bit*b) mod p step where t is an actual chain intermediate (a_{>=i}*b) mod p, not a uniform sample), which matches the training distribution to the states the chain visits at inference and lifts tier 5 from 0.74 to 0.98; still ordinary supervised BCE on the same single-step target, no backprop through the recurrence. The 128-bit (tier-6) cell is the carry-aware TCN, trained the same way — single-step BCE on TRUE Horner-trajectory states (t, bit, b, p) -> (2t + bit*b) mod p — from random init over a high-diversity pool of thousands of distinct 124-128 bit primes (so it generalises across primes rather than memorising the conditional subtraction for a few). Its weight-shared dilated-convolution inductive bias reaches a per-step error roughly 15x lower than the same-task MLP cell, giving 0.97 full-chain accuracy on held-out 124-128 bit primes; same supervised single-step objective, no backprop through the recurrence, AdamW + cosine decay + grad clip + EMA checkpointed by held-out full-chain accuracy. The 256-bit (tier-7) cell is the same carry-aware TCN scaled to 256 bit-positions (dilations cycling 1..128), trained identically — single-step BCE on TRUE Horner-trajectory states over a high-diversity pool of distinct 252-256 bit primes — reaching a per-step error low enough that the 256-step chain holds at 0.98 full-chain accuracy on held-out 252-256 bit primes. Weight-perturbation compliance (exploration/compliance_perturb.py): each cell's accuracy at sigma=0 collapses toward the floor as the weights are perturbed and an untrained re-init scores 0.00 — e.g. tier 6 0.97 -> 0.19 (sigma=0.25) -> 0.02 (sigma=0.5) and tier 7 0.98 -> 0.06 (sigma=0.25) -> 0.02 (sigma=0.5), untrained 0.00 for both — so the arithmetic resides in the trained parameters. Training scripts: train.py (16-bit), exploration/train_horner32.py (32-bit), exploration/train_horner64.py (64-bit phase 1, --residual) then exploration/train_horner64_traj.py (64-bit phase 2, trajectory), exploration/train_horner128_bigru.py --arch tcn (128-bit carry-aware TCN), exploration/train_horner_tcn.py --bits 256 (256-bit carry-aware TCN)."
}