BrΓΊjula-150M-32K-chat

A ~150M-parameter (157M trainable, tied embeddings) DeepSeek-style decoder that runs a 32,768-token context window on a single consumer GPU, fine-tuned for chat and long-context retrieval. I'm not aware of a smaller model with a working 32K context β€” comparably-tiny models like SmolLM2-135M cap out at 8K, and the smallest mainstream 32K model (Qwen2.5-0.5B) is ~3Γ— larger. (I haven't audited every model on the Hub, so this is not a claim to an absolute record β€” just that no smaller one is known to me.) It was built and trained end-to-end by one hobbyist on a single Intel Arc B580 (12 GB).

This is v1.0. It is good enough to be useful and, more importantly, small and cheap enough to fine-tune further yourself β€” that is the point of shipping it.

Honesty first: this is a 150M hobby model, not a Phi/SmolLM competitor. It chats, follows simple instructions, and retrieves facts from very long inputs, but it has the factual ceiling you'd expect at this size (see Limitations). Compare it to other consumer-GPU hobby projects, not to lab models.

What's interesting about it

  • 32K context at 150M. Models this small usually cap out at 2–8K (SmolLM2-135M: 8K; Pythia-160M: 2K), and the smallest mainstream models that actually do 32K are ~0.5B and up (Qwen2.5-0.5B). This one reads and retrieves across 32K tokens.
  • Architecture. Multi-head Latent Attention (MLA, DeepSeek-V2-style low-rank KV/Q β€” cheap KV at long context) + RoPE + SquaredReLU FFN + tied embeddings, with v3 Block Attention-Residuals (softmax attention over windowed block-sums of earlier layers instead of a plain residual sum; arXiv:2603.15031).
  • 32K via YaRN, baked in. Pre-trained at 1024 tokens, context-extended training-free with a YaRN (NTK-by-parts) RoPE warp, then fine-tuned at 16K with answer-masked SFT. The scaling ships inside the config β€” no manual RoPE surgery, the 32K behaviour is just there on load.

Usage

This model uses custom modeling code, so pass trust_remote_code=True:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "Sakatepon/Brujula-150M-32K-chat"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo, trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()   # "xpu" for Intel Arc; "cpu" works too

prompt = "User: Explain why the sky is blue in one paragraph.\nAssistant:"
ids = tok(prompt, return_tensors="pt").input_ids.to(model.device)
out = model.generate(
    ids, max_new_tokens=256,
    do_sample=True, temperature=0.8, top_p=0.95,
    repetition_penalty=1.3,          # important β€” see Limitations
    use_cache=False,                 # this reference impl has no KV cache
    pad_token_id=50256, eos_token_id=50256,
)
print(tok.decode(out[0], skip_special_tokens=True))

Chat format (plain, no special chat template):

User: <your message>
Assistant:

The model emits <|endoftext|> (id 50256) to end a turn; generation stops there.

Long-context (retrieval) use

Put the long document first, then ask at the end:

context = open("long_document.txt").read()
prompt = f"{context}\n\nUser: According to the document, who founded the company?\nAssistant:"

It accepts up to 32,768 tokens. Generation has no KV cache (the AttnRes aggregators mix across layers, which makes a per-step cache intricate), so it recomputes the context each step β€” correct but slow at long context. Short chats are fast.

Needle-in-a-haystack / passkey retrieval β€” test it this way, not as chat

This is the model's headline skill, but it is the one most people test wrong and walk away disappointed. Don't probe retrieval with the User:/Assistant: chat format and sampling.

Two reasons it fails that way:

  1. Format. The retrieval skill was trained and measured as a continuation, not a chat turn. The fact is planted as a flat sentence (... IMPORTANT: the X is Y. Please remember it.) and the query is a stem the model completes (Question: what is the X? Answer: the X is ). Wrapping the whole haystack in User: ... Assistant: and asking conversationally is out-of-distribution for the retrieval head β€” it reads as a chat and free-associates instead of looking the fact up.
  2. Decoding. Retrieval is a lookup, so decode it greedily (do_sample=False). With the chat defaults (temperature=0.8, top_p=0.95) the correct token is usually in the distribution but you won't reliably draw it β€” sampling turns a hit into a coin-flip. Also drop repetition_penalty here (it can push the model off the correct token) and keep max_new_tokens small (~16) β€” you only need the fact echoed back.

This is exactly the harness the model was evaluated with. Copy-paste it:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "Sakatepon/Brujula-150M-32K-chat"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo, trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()   # "xpu" for Intel Arc; "cpu" works too

NOISE = " The weather report mentioned clear skies and a gentle breeze across the quiet valley that morning."

def needle_test(label, secret, context_len=16384, depth=0.5):
    """Plant `secret` ~`depth` of the way into a `context_len`-token haystack, then ask for it."""
    nz   = tok(NOISE, add_special_tokens=False).input_ids
    fact = tok(f" IMPORTANT: the {label} is {secret}. Please remember it.", add_special_tokens=False).input_ids
    q    = tok(f" Question: what is the {label}? Answer: the {label} is", add_special_tokens=False).input_ids
    budget = max(0, context_len - len(fact) - len(q) - 2)
    fill   = lambda n: (nz * (n // len(nz) + 1))[:n]
    nb     = int(budget * depth)
    ids    = torch.tensor([fill(nb) + fact + fill(budget - nb) + q], device=model.device)
    out = model.generate(
        ids, max_new_tokens=16,
        do_sample=False,        # GREEDY β€” retrieval is a lookup, not a creative task
        use_cache=False,        # no KV cache in this reference impl
        pad_token_id=50256, eos_token_id=50256,
    )
    answer = tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True)
    print(f"[{label}] depth {depth:.0%} of {context_len:>6}: "
          f"{'βœ“ FOUND' if secret in answer else 'βœ— miss'}  -> {answer!r}")

needle_test("vault code",   "MELON-7714",      context_len=16384, depth=0.10)
needle_test("password",     "river-galaxy-92", context_len=16384, depth=0.50)
needle_test("room number",  "B-4417",          context_len=32768, depth=0.70)

What to expect (own harness β€” single-needle, greedy):

  • ≀ 16K: near-lossless retrieval β€” it reliably quotes the planted fact at any depth.
  • 32K: functional but degraded β€” passkey β‰ˆ 79%; it still reaches facts ~30K tokens back, but with multi-fact attribution noise ("lost in the middle": right number, wrong label β€” e.g. answers the room number when asked for the password). Single-fact retrieval is the solid use; don't expect crisp 3-way attribution at the far end.
  • A real chat/QA wrapper does work for retrieval if you keep it as a completion (long doc, then ... Answer: and let it continue) β€” just decode greedily. The chat persona is for short turns; the retrieval head wants a continuation.

Limitations (please read)

  • Factual ceiling. At 150M it confabulates. On document-QA it reliably copies facts placed in a long context, but for open-ended factual questions it is often wrong or invented. On a real research paper it answered ~2–3 of 5 fact questions; on a synthetic article ~3 of 5. It can also slip into continuing a document's voice instead of answering.
  • Repetition. Use repetition_penalty β‰ˆ 1.3. Without it, it loops. This single knob was the biggest lever on chat quality in testing.
  • Speed at 32K. No KV cache β†’ long-context generation is slow. Fine for retrieval/needle queries; not meant for long free-form generation at full context.
  • English only, GPT-2 BPE tokenizer (vocab 50257).
  • 32K quality is strongest for retrieval/needle tasks (find a fact in a haystack), not for long-form reasoning across the whole window.

Training

  • Backbone: v3 Block-AttnRes, 816 hidden / 18 layers / 6 heads, MLA (kv=64, q=192), SquaredReLU FFN, tied embeddings. Pre-trained on FineWeb-Edu at 1024 ctx.
  • Context extension: YaRN yarn_aggr (NTK-by-parts ramp Ξ²=16 + attention temperature mscale = 0.1Β·ln(32768/1024)+1), applied training-free, then answer-masked SFT at 16K (loss only on completion tokens) on a passkey/needle retrieval set to restore retrieval at the extended length.
  • Chat: answer-masked SFT on smol-smoltalk, mixed with retrieval windows so the 32K skill survives the chat tuning.
  • Hardware: a single Intel Arc B580 (12 GB). The whole project is a part-time hobby effort.

Reported numbers (own harness; not directly comparable to public leaderboards): the pretrained backbone scores β‰ˆ 19.5 val perplexity (FineWeb-Edu) / 32.7 on WikiText-103 β€” roughly GPT-2-small territory at this harness (GPT-2 small β‰ˆ 29.3 WikiText here). Chat+retrieval SFT validation perplexity is 4.91 on the SFT mixture (a different distribution, not an LM benchmark β€” don't compare it to the backbone numbers).

Want to make it better?

That's the idea. It's small enough to fine-tune on a single GPU. Continue SFT on your own chat or domain data, or do a document-QA round to push past the confabulation ceiling. The full training / eval / context-extension code (the faro/BrΓΊjula stack) is a separate project β€” the modeling code in this repo (modeling_brujula_v2.py, configuration_brujula_v2.py) is self-contained and matches that trainer bit-for-bit (logits parity-checked < 1e-5 at build).

License

Apache-2.0.

Downloads last month
32
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Sakatepon/Brujula-150M-32K-chat