---
license: apache-2.0
language: [en]
tags: [text-generation, small-models, mla, jepa, experimental]
pipeline_tag: text-generation
library_name: transformers
---

# Byrne-86M-Base

The **base** model of the Byrne family (distilled step-4000 checkpoint) — a strong general base for continued pretraining / fine-tuning. A ~86M-parameter, from-scratch `SpikeWhaleLM` decoder (Multi-head Latent Attention,
n-gram engram memory, hash-lookup layers, hyper-connections, HRM refinement, MTP) with a
custom ChatML-aware tokenizer. Trained with **Modal** credits during the **Small Models,
Big Adventures Hackathon**.

> **Related:** main model → [Byrne-86M](https://huggingface.co/Quazim0t0/Byrne-86M)

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("Quazim0t0/Byrne-86M-Base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Quazim0t0/Byrne-86M-Base", trust_remote_code=True)
```

<!-- ARCH_TOK_START -->
## Architecture

These models are built on **SpikeWhaleLM**, a custom ~86M-parameter decoder-only transformer
(16 layers, hidden size 640, 4096-token context, 16,512 vocab, tied input/output embeddings).
It combines several non-standard components:

- **Multi-head Latent Attention (MLA + XSA)** — queries and the output projection are
  LoRA-compressed (rank 128); each head splits into a decoupled RoPE part (dim 16) and a
  position-agnostic NoPE part (dim 48); 10 query heads share a **single KV head**
  (multi-query attention), with QK-norm for stable logits.
- **Engram n-gram memory** — a gated associative memory that hashes local n-grams (up to
  trigrams) into a learned 4,096-entry table and mixes the result back into the residual stream.
- **Hash-lookup layers (×2)** — multi-head content-addressable features alongside the token
  embeddings.
- **Hyper-Connections** — learned, width-expanded residual connections mixed via
  Sinkhorn-normalized routing, in place of the plain residual add.
- **HRM refinement** — a Hierarchical Reasoning Model block that performs an extra latent
  "think a bit more" refinement pass over the hidden states before the output head.
- **Multi-Token Prediction (MTP)** — a DeepSeek-V3-style auxiliary training head predicting
  more than one next token (no inference cost).
- Feed-forward is **dense** (the block is MoE-capable, but MoE is disabled in this release).

> **JEPA vs HRM.** The **Byrne** models are **Non-JEPA**: they are trained with **HRM refinement only** (`use_hrm_refine=True`, `use_jepa=False`). The sibling **Escarda** models add a **JEPA** (Joint-Embedding Predictive) auxiliary objective on top of HRM refinement.

## Tokenizer

These models use **`SpikeTokenizer`**, a custom **byte-level "length-max" (greedy
longest-match)** tokenizer with a **16,512-token vocabulary** — not a standard BPE/HF
tokenizer. Text is UTF-8 encoded, each byte mapped to a latin-1 character, then greedily
matched against the vocab using the longest key that fits at each position. It is
**ChatML-aware**, with atomic special tokens for framing and reasoning/tool markers
(`<|im_start|>`, `<|im_end|>`, `<think>`/`</think>`, `<begin_solution>`/`<end_solution>`,
tool-call markers) plus `<bos>`/`<eos>`/`<pad>`/`<unk>`. It ships as a `PreTrainedTokenizer`
subclass (`spike_tokenizer.py`) and loads via
`AutoTokenizer.from_pretrained(..., trust_remote_code=True)`.
<!-- ARCH_TOK_END -->

## Evaluation

log-likelihood, `acc_norm` = byte-length-normalized).

| Task | acc | acc_norm |
|---|---|---|
| arc_easy | 0.4205 | 0.3931 |
| arc_challenge | 0.1877 | 0.2389 |
| hellaswag | 0.2792 | 0.2927 |
| winogrande | 0.5193 | — |
| piqa | 0.5941 | 0.5860 |
| openbookqa | 0.1420 | 0.2820 |
| boolq | 0.6171 | — |

**ArithMark-2.0** ([AxiomicLabs](https://huggingface.co/datasets/AxiomicLabs/ArithMark-2.0))
— official metric is raw **`acc`**: **0.2732**.

**Language modeling:** WikiText-2 byte_ppl (↓) **2.3753** · BLiMP (↑) **0.7356**.

<!-- CITE_START -->
## Citation

If you use this model, please cite:

```bibtex
@misc{byrne86mbase,
  title        = {Byrne-86M-Base: A ~86M-parameter SpikeWhaleLM},
  author       = {Dean Byrne (Quazim0t0)},
  year         = {2026},
  howpublished = {HuggingFace, \url{https://huggingface.co/Quazim0t0/Byrne-86M-Base}},
  note         = {Quazim0t0/Byrne-86M-Base}
}
```
<!-- CITE_END -->