How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tjarvis91/qovaryx-350m-scratch-base", trust_remote_code=True)
# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("tjarvis91/qovaryx-350m-scratch-base", trust_remote_code=True, dtype="auto")
Quick Links

Qovaryx 350M β€” Scratch Base (random-init)

Compact AI is not small AI. A 350M-parameter trainable substrate engineered to punch above its weight class on a single consumer GPU. Random-init β€” bring your own corpus, train it from scratch. MTP-K=4, GQA, pluggable FFN backends (dense SwiGLU / ternary BitNet-style / sparse low-rank MoE), optional task-specific heads. Apache-2.0.

Compact β‰  small

Frontier-scale models cost a small country's GPU budget to train and a data-center to serve. Most real applications don't need 70B params; they need a focused 1B that does one thing extraordinarily well, fits in 16 GB of consumer VRAM, and stays on the hobbyist/researcher's local hardware β€” no API key, no inference bill, no token-rate limit, no provider drift.

The Qovaryx family is built around that thesis. Same component library at 50M / 350M / 1B sizes, all engineered to:

  • Train on a single Blackwell-class consumer GPU (RTX 4080 / 4090 / 5070 Ti / 5080 / 5090). 50M fits in <1 GB; 350M fits at batch=1 grad-accum on 12 GB; 1B fits in 16 GB with bf16 + adamw_8bit.
  • Inference on local hardware β€” no provider lock-in. A serious workstation runs the 1B at usable throughput; the 350M runs on a laptop.
  • Pack modern components into the smaller footprint: Multi-Token Prediction, GQA, ternary and sparse-MoE FFN backends, optional task heads. The architectural choices that make 70B models work also make 1B models punch above their weight class.

This repo is the random-init starting point for that research program. No pretraining has occurred β€” the model emits noise out of the box. It exists so you can train the architecture on your own tokens, your own task, your own budget, without paying the wall-clock cost of recreating the scaffold.

Think of it as a trainable substrate β€” like nanoGPT or the Pythia step-0 branches β€” but with a few modern components pre-wired:

  • Multi-Token Prediction (MTP-K=4) heads for jointly predicting up to 4 tokens ahead
  • Grouped-Query Attention (GQA) with configurable n_head / n_kv_head ratio (default 16:4)
  • Pluggable FFN backends: dense SwiGLU, ternary SwiGLU (BitNet-style with straight-through estimator), low-rank SwiGLU, routed low-rank MoE (4 experts top-1)
  • Optional task heads: 4-class decision head, raw-pixel chart-patch encoder (vision prefix tokens) β€” switchable via config
  • Custom 20,242-vocab BPE tokenizer β€” domain-leaning but broadly reusable
  • Packed mmap shard format for fast training on cheap consumer GPUs (one-time PackOnce compile, then mmap reads instead of per-row BPE)

Trained on a single RTX 5070 Ti (16 GB, Blackwell sm_120) using PyTorch 2.7 + flash-attn 2.7.4 + bnb 0.49.2 (adamw_8bit). 8-bit optimizer + bf16 + length curriculum means a 50M-param sibling fits in <1 GB and a 1B sibling fits in 16 GB at batch=1.


Why build on Qovaryx?

Compact AI is not small AI. Frontier-scale models ask how do we build the biggest intelligence possible? Qovaryx asks the inverse: how much disciplined intelligence can we extract per parameter, per watt, per GPU?

The published *-scratch-base checkpoints are the trainable substrate for that thesis. They are not pre-trained β€” they are the random-init starting point, engineered so that one person on one consumer GPU can take the architecture all the way to a focused specialist model without renting a data-center.

Dimension Frontier closed (GPT-5, Claude, Gemini) Frontier open (DeepSeek, Llama, Mistral, Qwen) Qovaryx
Primary philosophy Maximum general intelligence Open-weight general foundation Behavioral compression + corrective intelligence
Infrastructure Multi-datacenter clusters Multi-GPU enterprise / cloud βœ… Single consumer GPU (RTX 4080 / 4090 / 5070 Ti / 5080 / 5090)
Deployment Cloud / API only Cloud or local (β‰₯1Γ— A100-class at the larger sizes) βœ… Local-first, fits in 16 GB VRAM at every size
Cost model Very high compute + ongoing API spend Moderate-high compute, lower at inference βœ… Consumer-grade β€” power bill + GPU you already own
License Closed weights, ToS-gated Open weights (license varies) βœ… Apache-2.0 weights + Apache-2.0 reference trainer
Behavioral control Mostly emergent / safety-layer Fine-tune dependent βœ… Deterministic shell + crystal governance β€” explicit, not emergent
Specialization strategy One giant universal model General foundation, fine-tune downstream βœ… Modular specialists composed via the same compact base
Confidence handling Opaque token probabilities Token probabilities βœ… Calibrated 4-class decision head (action-gate-style classifier, optional)
Multi-token prediction Generally next-token only Generally next-token only βœ… MTP-K=4 built in (4-tokens-ahead joint head)
FFN options Dense Dense or MoE (frontier sizes) βœ… Pluggable: dense SwiGLU / ternary BitNet-style / sparse low-rank MoE β€” config flag
Attention MHA / GQA GQA βœ… GQA with configurable n_head:n_kv_head ratio
Training tokenizer Provider-controlled Provider-controlled βœ… You bundle it (20,242-vocab BPE shipped; replaceable)
Vision input Provider plugin Provider plugin βœ… Optional raw-pixel chart-patch encoder β€” switchable per-row at train time

βœ… = something Qovaryx provides out of the box on the scratch-base release.

This is not a claim that Qovaryx beats GPT-5 on MMLU. It will not. It is a claim that the right shape of small can do real work where the right shape of huge is unavailable, unaffordable, or unowned.

Why this base helps you build

  • The components are already wired β€” MTP-K, GQA, decision head, ternary/MoE FFN backends, chart patch encoder. Switchable via config. Skip three months of architecture work.
  • It fits β€” 50M fits anywhere; 350M fits on a 12 GB card; 1B fits on a 16 GB consumer card with adamw_8bit + bf16. You can actually train these on hardware you can actually buy.
  • It's honest about what's withheld β€” the architecture is open. The crystallization recipes, eval gold, verifier internals, and shell logic stay private. You build on Qovaryx's substrate; we don't pretend you're getting the whole stack.
  • Apache-2.0 β€” research, hobby, commercial. Attribution appreciated, not legally required.

Qovaryx is NOT trying to be

  • A frontier-IQ replacement
  • A benchmark champion on broad evals
  • A chat product
  • A substitute for engineering on the wrapper / verifier / shell β€” those are where compact AI earns its keep

Sizes in this family β€” consumer-GPU first

Repo Params d_model n_layer n_head n_kv_head d_ff VRAM @ training (bf16, adamw_8bit) VRAM @ inference (bf16)
tjarvis91/qovaryx-50m-scratch-base ~47M 512 12 8 2 1408 <1 GB <0.5 GB
tjarvis91/qovaryx-350m-scratch-base ~352M 1024 24 16 4 2816 ~3 GB ~1.5 GB
tjarvis91/qovaryx-1b-scratch-base ~1.05B 2048 22 16 4 5504 ~12 GB ~3 GB

All three share the same component library and tokenizer β€” pick the size your GPU can hold. You do not need an A100 to train these. A 16 GB consumer card handles every size in this family. A 12 GB card handles 50m + 350m comfortably. A 24 GB card lets you push 1B with larger batches.


TL;DR β€” what's in this repo

File Purpose
config.json Architecture spec (DecoderConfig) β€” d_model, n_layer, FFN kind, MTP-K, GQA ratio, vocab, max_seq_len
pytorch_model.bin Random-init weights (Glorot/Xavier per layer kind), bf16
tokenizer.json 20,242-vocab BPE (custom; domain-leaning but general-purpose)
tokenizer_config.json Tokenizer wrapping config
generation_config.json Default sampling params
modeling_qovaryx.py FinanceDecoder class (named for legacy reasons; the class is task-agnostic) + heads + FFN backends
train_quickstart.py A nanoGPT-style 200-line training loop you can run today
README.md This card

The model uses trust_remote_code=True (custom architecture). Load it like any other HF model.


Quickstart

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("tjarvis91/qovaryx-350m-scratch-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "tjarvis91/qovaryx-350m-scratch-base",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda()

# Out-of-the-box this generates noise β€” model is random-init by design.
# Train it on your own corpus, then it will be useful.
out = model.generate(tok("hello", return_tensors="pt").input_ids.cuda(), max_new_tokens=20)
print(tok.decode(out[0]))

Minimal training loop (single GPU, bf16, AdamW):

import torch
from torch.utils.data import DataLoader

opt = torch.optim.AdamW(model.parameters(), lr=2e-4, weight_decay=0.1, betas=(0.9, 0.95))
for step, batch in enumerate(your_dataloader):
    batch = {k: v.cuda() for k, v in batch.items()}
    with torch.amp.autocast("cuda", dtype=torch.bfloat16):
        out = model(**batch, labels=batch["input_ids"])
    out.loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    opt.step(); opt.zero_grad()
    if step % 10 == 0:
        print(f"step={step} loss={out.loss.item():.4f}")

A full reference recipe (length curriculum + MTP-K + decision-head + packed shards + adamw_8bit for 16 GB cards) is in train_quickstart.py.


FFN backends β€” switchable via config

Set ffn_kind in config.json (or via from_pretrained(..., ffn_kind=...)):

ffn_kind Description When to use
swiglu Dense SwiGLU (the obvious baseline) Default. Fastest wall-clock per step.
ternary_swiglu BitNet-style ternary weights with straight-through estimator When you care about deployable model size and accept ~3Γ— slower training
lowrank_swiglu Factorized projections (rank ffn_rank) Param compression without sparsity
routed_lowrank_swiglu Sparse MoE: ffn_experts top-ffn_top_k routing When you want capacity without dense FLOPs

These are inspired by published work (BitNet, DeepSeek-V3 MTP, Mixtral, GShard, ST-MoE). The novelty here is that all four share one trainer, one tokenizer, and one packed-shard pipeline β€” so switching backends is a config edit, not a fork.


Optional task heads

The base architecture exposes two opt-in heads, off by default:

  • decision_head_enabled β€” 4-class classification head pooled at a chosen token position. Useful for downstream policy / preference / structured-action tasks. Co-trained via masked CE.
  • chart_patch_encoder_enabled β€” strided-Conv2d raw-pixel encoder that converts an input image into prefix tokens, fed into the causal decoder before the text tokens. Useful for any text+image task; not specific to charts despite the name.

Both can be turned on per-row at training time (the trainer reads per-example metadata), so you can mix unimodal and multimodal rows in the same shard. Both are random-init in this repo and need to be trained alongside the LM head if you use them.


Suggested training recipes

These are starting points β€” tune to your data. Single 5070 Ti / RTX 4080-class GPU assumed.

50M baseline (LM only)

target_tokens:           500M-2B
tokens_per_batch:        4096
grad_accum_steps:        8
max_seq_len:             2048
length_curriculum:       (512,1000)(1024,3000)(2048,10000)(4096,-1)
lr:                      2e-4
warmup_steps:            500
weight_decay:            0.1
optimizer:               adamw_8bit (bf16)
attn_backend:            flash (FA2 if available, else PyTorch SDPA)
ffn_kind:                swiglu
mtp_weight:              0.3

350M with MTP + decision head

target_tokens:           5B-20B
tokens_per_batch:        8192
grad_accum_steps:        16
max_seq_len:             4096
ffn_kind:                ternary_swiglu  (or swiglu)
mtp_weight:              0.3
decision_weight:         0.5
class_weighted_decision: true
calibration_loss_weight: 0.2  (if you want a confidence-calibrated head)

1B with sparse MoE

target_tokens:           50B-200B
ffn_kind:                routed_lowrank_swiglu
ffn_rank:                128
ffn_experts:             4
ffn_top_k:               1
mixed_precision:         bf16
optimizer:               adamw_8bit

What this is NOT

  • Not a pretrained model. Out-of-the-box outputs are noise. Random initialization is the entire point.
  • Not finance-specific despite the legacy class name FinanceDecoder. The architecture is task-agnostic; the BPE tokenizer leans toward finance-aware merges but works on any English text.
  • Not a drop-in replacement for Llama / Qwen / Mistral. The component set is different (MTP-K heads in particular need their own training term).
  • Not adversarially robust. It's a substrate.
  • Not a tiny / toy model. 1B params at bf16 hits 2 GB on disk; trained well, it competes seriously on focused tasks. "Compact" means efficient, not weak.

License

Apache-2.0. Use it for research, commercial work, hobby projects β€” whatever. Attribution appreciated but not legally required.


Research notes

Qovaryx is part of a broader local-sovereign-AI research program. Higher-level framings, architectural rationale, and ablation studies are published progressively at:

Research index: https://github.com/thron-j/qovaryx-ai-research

Implementation details, training corpora, and certain ablation specifics are intentionally withheld in the public devlog. The framings are publishable; the internals are not. Collaboration inquiries: jeherizonllc@gmail.com.


Support

If this base helps you build something, support continued development:

β˜• ko-fi.com/tjarvis91

Every contribution funds GPU time and the next-generation Qovaryx training runs.


Sibling models in this lineage


Citation

@misc{qovaryx-scratch-base-2026,
  title     = {Qovaryx: A Compact Decoder Architecture with Multi-Token Prediction, GQA, and Pluggable FFN Backends},
  author    = {Jarvis, Thomas},
  year      = {2026},
  month     = {May},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/tjarvis91/qovaryx-350m-scratch-base}
}

Status

Random-init checkpoint as of 2026-05-22. Future updates will add trained sibling repos with downstream task heads enabled (decision head + chart-patch encoder variants). Watch the org page for new releases.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support