---
license: apache-2.0
language: [en]
tags: [text-generation, small-models, mla, jepa, experimental]
pipeline_tag: text-generation
library_name: transformers
---
# Byrne-86M-Base
The **base** model of the Byrne family (distilled step-4000 checkpoint) — a strong general base for continued pretraining / fine-tuning. A ~86M-parameter, from-scratch `SpikeWhaleLM` decoder (Multi-head Latent Attention,
n-gram engram memory, hash-lookup layers, hyper-connections, HRM refinement, MTP) with a
custom ChatML-aware tokenizer. Trained with **Modal** credits during the **Small Models,
Big Adventures Hackathon**.
> **Related:** main model → [Byrne-86M](https://huggingface.co/Quazim0t0/Byrne-86M)
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("Quazim0t0/Byrne-86M-Base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Quazim0t0/Byrne-86M-Base", trust_remote_code=True)
```
## Architecture
These models are built on **SpikeWhaleLM**, a custom ~86M-parameter decoder-only transformer
(16 layers, hidden size 640, 4096-token context, 16,512 vocab, tied input/output embeddings).
It combines several non-standard components:
- **Multi-head Latent Attention (MLA + XSA)** — queries and the output projection are
LoRA-compressed (rank 128); each head splits into a decoupled RoPE part (dim 16) and a
position-agnostic NoPE part (dim 48); 10 query heads share a **single KV head**
(multi-query attention), with QK-norm for stable logits.
- **Engram n-gram memory** — a gated associative memory that hashes local n-grams (up to
trigrams) into a learned 4,096-entry table and mixes the result back into the residual stream.
- **Hash-lookup layers (×2)** — multi-head content-addressable features alongside the token
embeddings.
- **Hyper-Connections** — learned, width-expanded residual connections mixed via
Sinkhorn-normalized routing, in place of the plain residual add.
- **HRM refinement** — a Hierarchical Reasoning Model block that performs an extra latent
"think a bit more" refinement pass over the hidden states before the output head.
- **Multi-Token Prediction (MTP)** — a DeepSeek-V3-style auxiliary training head predicting
more than one next token (no inference cost).
- Feed-forward is **dense** (the block is MoE-capable, but MoE is disabled in this release).
> **JEPA vs HRM.** The **Byrne** models are **Non-JEPA**: they are trained with **HRM refinement only** (`use_hrm_refine=True`, `use_jepa=False`). The sibling **Escarda** models add a **JEPA** (Joint-Embedding Predictive) auxiliary objective on top of HRM refinement.
## Tokenizer
These models use **`SpikeTokenizer`**, a custom **byte-level "length-max" (greedy
longest-match)** tokenizer with a **16,512-token vocabulary** — not a standard BPE/HF
tokenizer. Text is UTF-8 encoded, each byte mapped to a latin-1 character, then greedily
matched against the vocab using the longest key that fits at each position. It is
**ChatML-aware**, with atomic special tokens for framing and reasoning/tool markers
(`<|im_start|>`, `<|im_end|>`, ``/``, ``/``,
tool-call markers) plus ``/``/``/``. It ships as a `PreTrainedTokenizer`
subclass (`spike_tokenizer.py`) and loads via
`AutoTokenizer.from_pretrained(..., trust_remote_code=True)`.
## Evaluation
log-likelihood, `acc_norm` = byte-length-normalized).
| Task | acc | acc_norm |
|---|---|---|
| arc_easy | 0.4205 | 0.3931 |
| arc_challenge | 0.1877 | 0.2389 |
| hellaswag | 0.2792 | 0.2927 |
| winogrande | 0.5193 | — |
| piqa | 0.5941 | 0.5860 |
| openbookqa | 0.1420 | 0.2820 |
| boolq | 0.6171 | — |
**ArithMark-2.0** ([AxiomicLabs](https://huggingface.co/datasets/AxiomicLabs/ArithMark-2.0))
— official metric is raw **`acc`**: **0.2732**.
**Language modeling:** WikiText-2 byte_ppl (↓) **2.3753** · BLiMP (↑) **0.7356**.
## Citation
If you use this model, please cite:
```bibtex
@misc{byrne86mbase,
title = {Byrne-86M-Base: A ~86M-parameter SpikeWhaleLM},
author = {Dean Byrne (Quazim0t0)},
year = {2026},
howpublished = {HuggingFace, \url{https://huggingface.co/Quazim0t0/Byrne-86M-Base}},
note = {Quazim0t0/Byrne-86M-Base}
}
```