--- license: apache-2.0 language: [en] tags: [text-generation, small-models, mla, jepa, experimental] pipeline_tag: text-generation library_name: transformers --- # Byrne-86M-Base The **base** model of the Byrne family (distilled step-4000 checkpoint) — a strong general base for continued pretraining / fine-tuning. A ~86M-parameter, from-scratch `SpikeWhaleLM` decoder (Multi-head Latent Attention, n-gram engram memory, hash-lookup layers, hyper-connections, HRM refinement, MTP) with a custom ChatML-aware tokenizer. Trained with **Modal** credits during the **Small Models, Big Adventures Hackathon**. > **Related:** main model → [Byrne-86M](https://huggingface.co/Quazim0t0/Byrne-86M) ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer tok = AutoTokenizer.from_pretrained("Quazim0t0/Byrne-86M-Base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Quazim0t0/Byrne-86M-Base", trust_remote_code=True) ``` ## Architecture These models are built on **SpikeWhaleLM**, a custom ~86M-parameter decoder-only transformer (16 layers, hidden size 640, 4096-token context, 16,512 vocab, tied input/output embeddings). It combines several non-standard components: - **Multi-head Latent Attention (MLA + XSA)** — queries and the output projection are LoRA-compressed (rank 128); each head splits into a decoupled RoPE part (dim 16) and a position-agnostic NoPE part (dim 48); 10 query heads share a **single KV head** (multi-query attention), with QK-norm for stable logits. - **Engram n-gram memory** — a gated associative memory that hashes local n-grams (up to trigrams) into a learned 4,096-entry table and mixes the result back into the residual stream. - **Hash-lookup layers (×2)** — multi-head content-addressable features alongside the token embeddings. - **Hyper-Connections** — learned, width-expanded residual connections mixed via Sinkhorn-normalized routing, in place of the plain residual add. - **HRM refinement** — a Hierarchical Reasoning Model block that performs an extra latent "think a bit more" refinement pass over the hidden states before the output head. - **Multi-Token Prediction (MTP)** — a DeepSeek-V3-style auxiliary training head predicting more than one next token (no inference cost). - Feed-forward is **dense** (the block is MoE-capable, but MoE is disabled in this release). > **JEPA vs HRM.** The **Byrne** models are **Non-JEPA**: they are trained with **HRM refinement only** (`use_hrm_refine=True`, `use_jepa=False`). The sibling **Escarda** models add a **JEPA** (Joint-Embedding Predictive) auxiliary objective on top of HRM refinement. ## Tokenizer These models use **`SpikeTokenizer`**, a custom **byte-level "length-max" (greedy longest-match)** tokenizer with a **16,512-token vocabulary** — not a standard BPE/HF tokenizer. Text is UTF-8 encoded, each byte mapped to a latin-1 character, then greedily matched against the vocab using the longest key that fits at each position. It is **ChatML-aware**, with atomic special tokens for framing and reasoning/tool markers (`<|im_start|>`, `<|im_end|>`, ``/``, ``/``, tool-call markers) plus ``/``/``/``. It ships as a `PreTrainedTokenizer` subclass (`spike_tokenizer.py`) and loads via `AutoTokenizer.from_pretrained(..., trust_remote_code=True)`. ## Evaluation log-likelihood, `acc_norm` = byte-length-normalized). | Task | acc | acc_norm | |---|---|---| | arc_easy | 0.4205 | 0.3931 | | arc_challenge | 0.1877 | 0.2389 | | hellaswag | 0.2792 | 0.2927 | | winogrande | 0.5193 | — | | piqa | 0.5941 | 0.5860 | | openbookqa | 0.1420 | 0.2820 | | boolq | 0.6171 | — | **ArithMark-2.0** ([AxiomicLabs](https://huggingface.co/datasets/AxiomicLabs/ArithMark-2.0)) — official metric is raw **`acc`**: **0.2732**. **Language modeling:** WikiText-2 byte_ppl (↓) **2.3753** · BLiMP (↑) **0.7356**. ## Citation If you use this model, please cite: ```bibtex @misc{byrne86mbase, title = {Byrne-86M-Base: A ~86M-parameter SpikeWhaleLM}, author = {Dean Byrne (Quazim0t0)}, year = {2026}, howpublished = {HuggingFace, \url{https://huggingface.co/Quazim0t0/Byrne-86M-Base}}, note = {Quazim0t0/Byrne-86M-Base} } ```