Instructions to use tjarvis91/qovaryx-350m-scratch-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tjarvis91/qovaryx-350m-scratch-base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tjarvis91/qovaryx-350m-scratch-base", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("tjarvis91/qovaryx-350m-scratch-base", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use tjarvis91/qovaryx-350m-scratch-base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tjarvis91/qovaryx-350m-scratch-base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tjarvis91/qovaryx-350m-scratch-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/tjarvis91/qovaryx-350m-scratch-base

SGLang

How to use tjarvis91/qovaryx-350m-scratch-base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tjarvis91/qovaryx-350m-scratch-base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tjarvis91/qovaryx-350m-scratch-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tjarvis91/qovaryx-350m-scratch-base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tjarvis91/qovaryx-350m-scratch-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use tjarvis91/qovaryx-350m-scratch-base with Docker Model Runner:
```
docker model run hf.co/tjarvis91/qovaryx-350m-scratch-base
```

Qovaryx 350M — Scratch Base (random-init)

Compact AI is not small AI. A 350M-parameter trainable substrate engineered to punch above its weight class on a single consumer GPU. Random-init — bring your own corpus, train it from scratch. MTP-K=4, GQA, pluggable FFN backends (dense SwiGLU / ternary BitNet-style / sparse low-rank MoE), optional task-specific heads. Apache-2.0.

Compact ≠ small

Frontier-scale models cost a small country's GPU budget to train and a data-center to serve. Most real applications don't need 70B params; they need a focused 1B that does one thing extraordinarily well, fits in 16 GB of consumer VRAM, and stays on the hobbyist/researcher's local hardware — no API key, no inference bill, no token-rate limit, no provider drift.

The Qovaryx family is built around that thesis. Same component library at 50M / 350M / 1B sizes, all engineered to:

Train on a single Blackwell-class consumer GPU (RTX 4080 / 4090 / 5070 Ti / 5080 / 5090). 50M fits in <1 GB; 350M fits at batch=1 grad-accum on 12 GB; 1B fits in 16 GB with bf16 + adamw_8bit.
Inference on local hardware — no provider lock-in. A serious workstation runs the 1B at usable throughput; the 350M runs on a laptop.
Pack modern components into the smaller footprint: Multi-Token Prediction, GQA, ternary and sparse-MoE FFN backends, optional task heads. The architectural choices that make 70B models work also make 1B models punch above their weight class.

This repo is the random-init starting point for that research program. No pretraining has occurred — the model emits noise out of the box. It exists so you can train the architecture on your own tokens, your own task, your own budget, without paying the wall-clock cost of recreating the scaffold.

Think of it as a trainable substrate — like nanoGPT or the Pythia step-0 branches — but with a few modern components pre-wired:

Multi-Token Prediction (MTP-K=4) heads for jointly predicting up to 4 tokens ahead
Grouped-Query Attention (GQA) with configurable n_head / n_kv_head ratio (default 16:4)
Pluggable FFN backends: dense SwiGLU, ternary SwiGLU (BitNet-style with straight-through estimator), low-rank SwiGLU, routed low-rank MoE (4 experts top-1)
Optional task heads: 4-class decision head, raw-pixel chart-patch encoder (vision prefix tokens) — switchable via config
Custom 20,242-vocab BPE tokenizer — domain-leaning but broadly reusable
Packed mmap shard format for fast training on cheap consumer GPUs (one-time PackOnce compile, then mmap reads instead of per-row BPE)

Trained on a single RTX 5070 Ti (16 GB, Blackwell sm_120) using PyTorch 2.7 + flash-attn 2.7.4 + bnb 0.49.2 (adamw_8bit). 8-bit optimizer + bf16 + length curriculum means a 50M-param sibling fits in <1 GB and a 1B sibling fits in 16 GB at batch=1.

Why build on Qovaryx?

Compact AI is not small AI. Frontier-scale models ask how do we build the biggest intelligence possible? Qovaryx asks the inverse: how much disciplined intelligence can we extract per parameter, per watt, per GPU?

The published *-scratch-base checkpoints are the trainable substrate for that thesis. They are not pre-trained — they are the random-init starting point, engineered so that one person on one consumer GPU can take the architecture all the way to a focused specialist model without renting a data-center.

Dimension	Frontier closed (GPT-5, Claude, Gemini)	Frontier open (DeepSeek, Llama, Mistral, Qwen)	Qovaryx
Primary philosophy	Maximum general intelligence	Open-weight general foundation	Behavioral compression + corrective intelligence
Infrastructure	Multi-datacenter clusters	Multi-GPU enterprise / cloud	✅ Single consumer GPU (RTX 4080 / 4090 / 5070 Ti / 5080 / 5090)
Deployment	Cloud / API only	Cloud or local (≥1× A100-class at the larger sizes)	✅ Local-first, fits in 16 GB VRAM at every size
Cost model	Very high compute + ongoing API spend	Moderate-high compute, lower at inference	✅ Consumer-grade — power bill + GPU you already own
License	Closed weights, ToS-gated	Open weights (license varies)	✅ Apache-2.0 weights + Apache-2.0 reference trainer
Behavioral control	Mostly emergent / safety-layer	Fine-tune dependent	✅ Deterministic shell + crystal governance — explicit, not emergent
Specialization strategy	One giant universal model	General foundation, fine-tune downstream	✅ Modular specialists composed via the same compact base
Confidence handling	Opaque token probabilities	Token probabilities	✅ Calibrated 4-class decision head (action-gate-style classifier, optional)
Multi-token prediction	Generally next-token only	Generally next-token only	✅ MTP-K=4 built in (4-tokens-ahead joint head)
FFN options	Dense	Dense or MoE (frontier sizes)	✅ Pluggable: dense SwiGLU / ternary BitNet-style / sparse low-rank MoE — config flag
Attention	MHA / GQA	GQA	✅ GQA with configurable n_head:n_kv_head ratio
Training tokenizer	Provider-controlled	Provider-controlled	✅ You bundle it (20,242-vocab BPE shipped; replaceable)
Vision input	Provider plugin	Provider plugin	✅ Optional raw-pixel chart-patch encoder — switchable per-row at train time

✅ = something Qovaryx provides out of the box on the scratch-base release.

This is not a claim that Qovaryx beats GPT-5 on MMLU. It will not. It is a claim that the right shape of small can do real work where the right shape of huge is unavailable, unaffordable, or unowned.

Why this base helps you build

The components are already wired — MTP-K, GQA, decision head, ternary/MoE FFN backends, chart patch encoder. Switchable via config. Skip three months of architecture work.
It fits — 50M fits anywhere; 350M fits on a 12 GB card; 1B fits on a 16 GB consumer card with adamw_8bit + bf16. You can actually train these on hardware you can actually buy.
It's honest about what's withheld — the architecture is open. The crystallization recipes, eval gold, verifier internals, and shell logic stay private. You build on Qovaryx's substrate; we don't pretend you're getting the whole stack.
Apache-2.0 — research, hobby, commercial. Attribution appreciated, not legally required.

Qovaryx is NOT trying to be

A frontier-IQ replacement
A benchmark champion on broad evals
A chat product
A substitute for engineering on the wrapper / verifier / shell — those are where compact AI earns its keep

Sizes in this family — consumer-GPU first

Repo	Params	d_model	n_layer	n_head	n_kv_head	d_ff	VRAM @ training (bf16, adamw_8bit)	VRAM @ inference (bf16)
`tjarvis91/qovaryx-50m-scratch-base`	~47M	512	12	8	2	1408	<1 GB	<0.5 GB
`tjarvis91/qovaryx-350m-scratch-base`	~352M	1024	24	16	4	2816	~3 GB	~1.5 GB
`tjarvis91/qovaryx-1b-scratch-base`	~1.05B	2048	22	16	4	5504	~12 GB	~3 GB

All three share the same component library and tokenizer — pick the size your GPU can hold. You do not need an A100 to train these. A 16 GB consumer card handles every size in this family. A 12 GB card handles 50m + 350m comfortably. A 24 GB card lets you push 1B with larger batches.

TL;DR — what's in this repo

File	Purpose
`config.json`	Architecture spec (`DecoderConfig`) — d_model, n_layer, FFN kind, MTP-K, GQA ratio, vocab, max_seq_len
`pytorch_model.bin`	Random-init weights (Glorot/Xavier per layer kind), bf16
`tokenizer.json`	20,242-vocab BPE (custom; domain-leaning but general-purpose)
`tokenizer_config.json`	Tokenizer wrapping config
`generation_config.json`	Default sampling params
`modeling_qovaryx.py`	`FinanceDecoder` class (named for legacy reasons; the class is task-agnostic) + heads + FFN backends
`train_quickstart.py`	A nanoGPT-style 200-line training loop you can run today
`README.md`	This card

The model uses trust_remote_code=True (custom architecture). Load it like any other HF model.

Quickstart

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("tjarvis91/qovaryx-350m-scratch-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "tjarvis91/qovaryx-350m-scratch-base",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda()

# Out-of-the-box this generates noise — model is random-init by design.
# Train it on your own corpus, then it will be useful.
out = model.generate(tok("hello", return_tensors="pt").input_ids.cuda(), max_new_tokens=20)
print(tok.decode(out[0]))

Minimal training loop (single GPU, bf16, AdamW):

import torch
from torch.utils.data import DataLoader

opt = torch.optim.AdamW(model.parameters(), lr=2e-4, weight_decay=0.1, betas=(0.9, 0.95))
for step, batch in enumerate(your_dataloader):
    batch = {k: v.cuda() for k, v in batch.items()}
    with torch.amp.autocast("cuda", dtype=torch.bfloat16):
        out = model(**batch, labels=batch["input_ids"])
    out.loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    opt.step(); opt.zero_grad()
    if step % 10 == 0:
        print(f"step={step} loss={out.loss.item():.4f}")

A full reference recipe (length curriculum + MTP-K + decision-head + packed shards + adamw_8bit for 16 GB cards) is in train_quickstart.py.

FFN backends — switchable via config

Set ffn_kind in config.json (or via from_pretrained(..., ffn_kind=...)):

`ffn_kind`	Description	When to use
`swiglu`	Dense SwiGLU (the obvious baseline)	Default. Fastest wall-clock per step.
`ternary_swiglu`	BitNet-style ternary weights with straight-through estimator	When you care about deployable model size and accept ~3× slower training
`lowrank_swiglu`	Factorized projections (rank `ffn_rank`)	Param compression without sparsity
`routed_lowrank_swiglu`	Sparse MoE: `ffn_experts` top-`ffn_top_k` routing	When you want capacity without dense FLOPs

These are inspired by published work (BitNet, DeepSeek-V3 MTP, Mixtral, GShard, ST-MoE). The novelty here is that all four share one trainer, one tokenizer, and one packed-shard pipeline — so switching backends is a config edit, not a fork.

Optional task heads

The base architecture exposes two opt-in heads, off by default:

decision_head_enabled — 4-class classification head pooled at a chosen token position. Useful for downstream policy / preference / structured-action tasks. Co-trained via masked CE.
chart_patch_encoder_enabled — strided-Conv2d raw-pixel encoder that converts an input image into prefix tokens, fed into the causal decoder before the text tokens. Useful for any text+image task; not specific to charts despite the name.

Both can be turned on per-row at training time (the trainer reads per-example metadata), so you can mix unimodal and multimodal rows in the same shard. Both are random-init in this repo and need to be trained alongside the LM head if you use them.

Suggested training recipes

These are starting points — tune to your data. Single 5070 Ti / RTX 4080-class GPU assumed.

50M baseline (LM only)

target_tokens:           500M-2B
tokens_per_batch:        4096
grad_accum_steps:        8
max_seq_len:             2048
length_curriculum:       (512,1000)(1024,3000)(2048,10000)(4096,-1)
lr:                      2e-4
warmup_steps:            500
weight_decay:            0.1
optimizer:               adamw_8bit (bf16)
attn_backend:            flash (FA2 if available, else PyTorch SDPA)
ffn_kind:                swiglu
mtp_weight:              0.3

350M with MTP + decision head

target_tokens:           5B-20B
tokens_per_batch:        8192
grad_accum_steps:        16
max_seq_len:             4096
ffn_kind:                ternary_swiglu  (or swiglu)
mtp_weight:              0.3
decision_weight:         0.5
class_weighted_decision: true
calibration_loss_weight: 0.2  (if you want a confidence-calibrated head)

1B with sparse MoE

target_tokens:           50B-200B
ffn_kind:                routed_lowrank_swiglu
ffn_rank:                128
ffn_experts:             4
ffn_top_k:               1
mixed_precision:         bf16
optimizer:               adamw_8bit

What this is NOT

Not a pretrained model. Out-of-the-box outputs are noise. Random initialization is the entire point.
Not finance-specific despite the legacy class name FinanceDecoder. The architecture is task-agnostic; the BPE tokenizer leans toward finance-aware merges but works on any English text.
Not a drop-in replacement for Llama / Qwen / Mistral. The component set is different (MTP-K heads in particular need their own training term).
Not adversarially robust. It's a substrate.
Not a tiny / toy model. 1B params at bf16 hits 2 GB on disk; trained well, it competes seriously on focused tasks. "Compact" means efficient, not weak.

License

Apache-2.0. Use it for research, commercial work, hobby projects — whatever. Attribution appreciated but not legally required.

Research notes

Qovaryx is part of a broader local-sovereign-AI research program. Higher-level framings, architectural rationale, and ablation studies are published progressively at:

Research index: https://github.com/thron-j/qovaryx-ai-research

Implementation details, training corpora, and certain ablation specifics are intentionally withheld in the public devlog. The framings are publishable; the internals are not. Collaboration inquiries: jeherizonllc@gmail.com.

Support

If this base helps you build something, support continued development:

☕ ko-fi.com/tjarvis91

Every contribution funds GPU time and the next-generation Qovaryx training runs.

Sibling models in this lineage

tjarvis91/qovaryx-50m-scratch-base -- 47M params, 12 layers, fits on any GPU
tjarvis91/qovaryx-350m-scratch-base <- you are here
tjarvis91/qovaryx-1b-scratch-base -- 1.05B params, 22 layers, the full consumer-GPU target
tjarvis91/vfaix-vpa-options-trader -- a separate, trained 9B vision-language model that uses the same training disciplines on Qwen3.5-VL (not the same architecture; shown here for lineage context)

Citation

@misc{qovaryx-scratch-base-2026,
  title     = {Qovaryx: A Compact Decoder Architecture with Multi-Token Prediction, GQA, and Pluggable FFN Backends},
  author    = {Jarvis, Thomas},
  year      = {2026},
  month     = {May},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/tjarvis91/qovaryx-350m-scratch-base}
}

Status

Random-init checkpoint as of 2026-05-22. Future updates will add trained sibling repos with downstream task heads enabled (decision head + chart-patch encoder variants). Watch the org page for new releases.

Downloads last month: -