Instructions to use tjarvis91/qovaryx-350m-scratch-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tjarvis91/qovaryx-350m-scratch-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tjarvis91/qovaryx-350m-scratch-base", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("tjarvis91/qovaryx-350m-scratch-base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use tjarvis91/qovaryx-350m-scratch-base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tjarvis91/qovaryx-350m-scratch-base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tjarvis91/qovaryx-350m-scratch-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/tjarvis91/qovaryx-350m-scratch-base
- SGLang
How to use tjarvis91/qovaryx-350m-scratch-base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tjarvis91/qovaryx-350m-scratch-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tjarvis91/qovaryx-350m-scratch-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tjarvis91/qovaryx-350m-scratch-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tjarvis91/qovaryx-350m-scratch-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use tjarvis91/qovaryx-350m-scratch-base with Docker Model Runner:
docker model run hf.co/tjarvis91/qovaryx-350m-scratch-base
# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("tjarvis91/qovaryx-350m-scratch-base", trust_remote_code=True, dtype="auto")- Qovaryx 350M β Scratch Base (random-init)
- Compact β small
- Why build on Qovaryx?
- Sizes in this family β consumer-GPU first
- TL;DR β what's in this repo
- Quickstart
- FFN backends β switchable via config
- Optional task heads
- Suggested training recipes
- What this is NOT
- License
- Research notes
- Support
- Sibling models in this lineage
- Citation
- Status
- Compact β small
Qovaryx 350M β Scratch Base (random-init)
Compact AI is not small AI. A 350M-parameter trainable substrate engineered to punch above its weight class on a single consumer GPU. Random-init β bring your own corpus, train it from scratch. MTP-K=4, GQA, pluggable FFN backends (dense SwiGLU / ternary BitNet-style / sparse low-rank MoE), optional task-specific heads. Apache-2.0.
Compact β small
Frontier-scale models cost a small country's GPU budget to train and a data-center to serve. Most real applications don't need 70B params; they need a focused 1B that does one thing extraordinarily well, fits in 16 GB of consumer VRAM, and stays on the hobbyist/researcher's local hardware β no API key, no inference bill, no token-rate limit, no provider drift.
The Qovaryx family is built around that thesis. Same component library at 50M / 350M / 1B sizes, all engineered to:
- Train on a single Blackwell-class consumer GPU (RTX 4080 / 4090 / 5070 Ti / 5080 / 5090). 50M fits in <1 GB; 350M fits at batch=1 grad-accum on 12 GB; 1B fits in 16 GB with bf16 +
adamw_8bit. - Inference on local hardware β no provider lock-in. A serious workstation runs the 1B at usable throughput; the 350M runs on a laptop.
- Pack modern components into the smaller footprint: Multi-Token Prediction, GQA, ternary and sparse-MoE FFN backends, optional task heads. The architectural choices that make 70B models work also make 1B models punch above their weight class.
This repo is the random-init starting point for that research program. No pretraining has occurred β the model emits noise out of the box. It exists so you can train the architecture on your own tokens, your own task, your own budget, without paying the wall-clock cost of recreating the scaffold.
Think of it as a trainable substrate β like nanoGPT or the Pythia step-0 branches β but with a few modern components pre-wired:
- Multi-Token Prediction (MTP-K=4) heads for jointly predicting up to 4 tokens ahead
- Grouped-Query Attention (GQA) with configurable
n_head/n_kv_headratio (default 16:4) - Pluggable FFN backends: dense SwiGLU, ternary SwiGLU (BitNet-style with straight-through estimator), low-rank SwiGLU, routed low-rank MoE (4 experts top-1)
- Optional task heads: 4-class decision head, raw-pixel chart-patch encoder (vision prefix tokens) β switchable via config
- Custom 20,242-vocab BPE tokenizer β domain-leaning but broadly reusable
- Packed mmap shard format for fast training on cheap consumer GPUs (one-time PackOnce compile, then mmap reads instead of per-row BPE)
Trained on a single RTX 5070 Ti (16 GB, Blackwell sm_120) using PyTorch 2.7 + flash-attn 2.7.4 + bnb 0.49.2 (adamw_8bit). 8-bit optimizer + bf16 + length curriculum means a 50M-param sibling fits in <1 GB and a 1B sibling fits in 16 GB at batch=1.
Why build on Qovaryx?
Compact AI is not small AI. Frontier-scale models ask how do we build the biggest intelligence possible? Qovaryx asks the inverse: how much disciplined intelligence can we extract per parameter, per watt, per GPU?
The published *-scratch-base checkpoints are the trainable substrate for that thesis. They are not pre-trained β they are the random-init starting point, engineered so that one person on one consumer GPU can take the architecture all the way to a focused specialist model without renting a data-center.
| Dimension | Frontier closed (GPT-5, Claude, Gemini) | Frontier open (DeepSeek, Llama, Mistral, Qwen) | Qovaryx |
|---|---|---|---|
| Primary philosophy | Maximum general intelligence | Open-weight general foundation | Behavioral compression + corrective intelligence |
| Infrastructure | Multi-datacenter clusters | Multi-GPU enterprise / cloud | β Single consumer GPU (RTX 4080 / 4090 / 5070 Ti / 5080 / 5090) |
| Deployment | Cloud / API only | Cloud or local (β₯1Γ A100-class at the larger sizes) | β Local-first, fits in 16 GB VRAM at every size |
| Cost model | Very high compute + ongoing API spend | Moderate-high compute, lower at inference | β Consumer-grade β power bill + GPU you already own |
| License | Closed weights, ToS-gated | Open weights (license varies) | β Apache-2.0 weights + Apache-2.0 reference trainer |
| Behavioral control | Mostly emergent / safety-layer | Fine-tune dependent | β Deterministic shell + crystal governance β explicit, not emergent |
| Specialization strategy | One giant universal model | General foundation, fine-tune downstream | β Modular specialists composed via the same compact base |
| Confidence handling | Opaque token probabilities | Token probabilities | β Calibrated 4-class decision head (action-gate-style classifier, optional) |
| Multi-token prediction | Generally next-token only | Generally next-token only | β MTP-K=4 built in (4-tokens-ahead joint head) |
| FFN options | Dense | Dense or MoE (frontier sizes) | β Pluggable: dense SwiGLU / ternary BitNet-style / sparse low-rank MoE β config flag |
| Attention | MHA / GQA | GQA | β GQA with configurable n_head:n_kv_head ratio |
| Training tokenizer | Provider-controlled | Provider-controlled | β You bundle it (20,242-vocab BPE shipped; replaceable) |
| Vision input | Provider plugin | Provider plugin | β Optional raw-pixel chart-patch encoder β switchable per-row at train time |
β = something Qovaryx provides out of the box on the scratch-base release.
This is not a claim that Qovaryx beats GPT-5 on MMLU. It will not. It is a claim that the right shape of small can do real work where the right shape of huge is unavailable, unaffordable, or unowned.
Why this base helps you build
- The components are already wired β MTP-K, GQA, decision head, ternary/MoE FFN backends, chart patch encoder. Switchable via config. Skip three months of architecture work.
- It fits β 50M fits anywhere; 350M fits on a 12 GB card; 1B fits on a 16 GB consumer card with
adamw_8bit+ bf16. You can actually train these on hardware you can actually buy. - It's honest about what's withheld β the architecture is open. The crystallization recipes, eval gold, verifier internals, and shell logic stay private. You build on Qovaryx's substrate; we don't pretend you're getting the whole stack.
- Apache-2.0 β research, hobby, commercial. Attribution appreciated, not legally required.
Qovaryx is NOT trying to be
- A frontier-IQ replacement
- A benchmark champion on broad evals
- A chat product
- A substitute for engineering on the wrapper / verifier / shell β those are where compact AI earns its keep
Sizes in this family β consumer-GPU first
| Repo | Params | d_model | n_layer | n_head | n_kv_head | d_ff | VRAM @ training (bf16, adamw_8bit) | VRAM @ inference (bf16) |
|---|---|---|---|---|---|---|---|---|
tjarvis91/qovaryx-50m-scratch-base |
~47M | 512 | 12 | 8 | 2 | 1408 | <1 GB | <0.5 GB |
tjarvis91/qovaryx-350m-scratch-base |
~352M | 1024 | 24 | 16 | 4 | 2816 | ~3 GB | ~1.5 GB |
tjarvis91/qovaryx-1b-scratch-base |
~1.05B | 2048 | 22 | 16 | 4 | 5504 | ~12 GB | ~3 GB |
All three share the same component library and tokenizer β pick the size your GPU can hold. You do not need an A100 to train these. A 16 GB consumer card handles every size in this family. A 12 GB card handles 50m + 350m comfortably. A 24 GB card lets you push 1B with larger batches.
TL;DR β what's in this repo
| File | Purpose |
|---|---|
config.json |
Architecture spec (DecoderConfig) β d_model, n_layer, FFN kind, MTP-K, GQA ratio, vocab, max_seq_len |
pytorch_model.bin |
Random-init weights (Glorot/Xavier per layer kind), bf16 |
tokenizer.json |
20,242-vocab BPE (custom; domain-leaning but general-purpose) |
tokenizer_config.json |
Tokenizer wrapping config |
generation_config.json |
Default sampling params |
modeling_qovaryx.py |
FinanceDecoder class (named for legacy reasons; the class is task-agnostic) + heads + FFN backends |
train_quickstart.py |
A nanoGPT-style 200-line training loop you can run today |
README.md |
This card |
The model uses trust_remote_code=True (custom architecture). Load it like any other HF model.
Quickstart
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("tjarvis91/qovaryx-350m-scratch-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"tjarvis91/qovaryx-350m-scratch-base",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).cuda()
# Out-of-the-box this generates noise β model is random-init by design.
# Train it on your own corpus, then it will be useful.
out = model.generate(tok("hello", return_tensors="pt").input_ids.cuda(), max_new_tokens=20)
print(tok.decode(out[0]))
Minimal training loop (single GPU, bf16, AdamW):
import torch
from torch.utils.data import DataLoader
opt = torch.optim.AdamW(model.parameters(), lr=2e-4, weight_decay=0.1, betas=(0.9, 0.95))
for step, batch in enumerate(your_dataloader):
batch = {k: v.cuda() for k, v in batch.items()}
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
out = model(**batch, labels=batch["input_ids"])
out.loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
opt.step(); opt.zero_grad()
if step % 10 == 0:
print(f"step={step} loss={out.loss.item():.4f}")
A full reference recipe (length curriculum + MTP-K + decision-head + packed shards + adamw_8bit for 16 GB cards) is in train_quickstart.py.
FFN backends β switchable via config
Set ffn_kind in config.json (or via from_pretrained(..., ffn_kind=...)):
ffn_kind |
Description | When to use |
|---|---|---|
swiglu |
Dense SwiGLU (the obvious baseline) | Default. Fastest wall-clock per step. |
ternary_swiglu |
BitNet-style ternary weights with straight-through estimator | When you care about deployable model size and accept ~3Γ slower training |
lowrank_swiglu |
Factorized projections (rank ffn_rank) |
Param compression without sparsity |
routed_lowrank_swiglu |
Sparse MoE: ffn_experts top-ffn_top_k routing |
When you want capacity without dense FLOPs |
These are inspired by published work (BitNet, DeepSeek-V3 MTP, Mixtral, GShard, ST-MoE). The novelty here is that all four share one trainer, one tokenizer, and one packed-shard pipeline β so switching backends is a config edit, not a fork.
Optional task heads
The base architecture exposes two opt-in heads, off by default:
decision_head_enabledβ 4-class classification head pooled at a chosen token position. Useful for downstream policy / preference / structured-action tasks. Co-trained via masked CE.chart_patch_encoder_enabledβ strided-Conv2d raw-pixel encoder that converts an input image into prefix tokens, fed into the causal decoder before the text tokens. Useful for any text+image task; not specific to charts despite the name.
Both can be turned on per-row at training time (the trainer reads per-example metadata), so you can mix unimodal and multimodal rows in the same shard. Both are random-init in this repo and need to be trained alongside the LM head if you use them.
Suggested training recipes
These are starting points β tune to your data. Single 5070 Ti / RTX 4080-class GPU assumed.
50M baseline (LM only)
target_tokens: 500M-2B
tokens_per_batch: 4096
grad_accum_steps: 8
max_seq_len: 2048
length_curriculum: (512,1000)(1024,3000)(2048,10000)(4096,-1)
lr: 2e-4
warmup_steps: 500
weight_decay: 0.1
optimizer: adamw_8bit (bf16)
attn_backend: flash (FA2 if available, else PyTorch SDPA)
ffn_kind: swiglu
mtp_weight: 0.3
350M with MTP + decision head
target_tokens: 5B-20B
tokens_per_batch: 8192
grad_accum_steps: 16
max_seq_len: 4096
ffn_kind: ternary_swiglu (or swiglu)
mtp_weight: 0.3
decision_weight: 0.5
class_weighted_decision: true
calibration_loss_weight: 0.2 (if you want a confidence-calibrated head)
1B with sparse MoE
target_tokens: 50B-200B
ffn_kind: routed_lowrank_swiglu
ffn_rank: 128
ffn_experts: 4
ffn_top_k: 1
mixed_precision: bf16
optimizer: adamw_8bit
What this is NOT
- Not a pretrained model. Out-of-the-box outputs are noise. Random initialization is the entire point.
- Not finance-specific despite the legacy class name
FinanceDecoder. The architecture is task-agnostic; the BPE tokenizer leans toward finance-aware merges but works on any English text. - Not a drop-in replacement for Llama / Qwen / Mistral. The component set is different (MTP-K heads in particular need their own training term).
- Not adversarially robust. It's a substrate.
- Not a tiny / toy model. 1B params at bf16 hits 2 GB on disk; trained well, it competes seriously on focused tasks. "Compact" means efficient, not weak.
License
Apache-2.0. Use it for research, commercial work, hobby projects β whatever. Attribution appreciated but not legally required.
Research notes
Qovaryx is part of a broader local-sovereign-AI research program. Higher-level framings, architectural rationale, and ablation studies are published progressively at:
Research index: https://github.com/thron-j/qovaryx-ai-research
Implementation details, training corpora, and certain ablation specifics are intentionally withheld in the public devlog. The framings are publishable; the internals are not. Collaboration inquiries: jeherizonllc@gmail.com.
Support
If this base helps you build something, support continued development:
Every contribution funds GPU time and the next-generation Qovaryx training runs.
Sibling models in this lineage
tjarvis91/qovaryx-50m-scratch-base-- 47M params, 12 layers, fits on any GPUtjarvis91/qovaryx-350m-scratch-base<- you are heretjarvis91/qovaryx-1b-scratch-base-- 1.05B params, 22 layers, the full consumer-GPU targettjarvis91/vfaix-vpa-options-trader-- a separate, trained 9B vision-language model that uses the same training disciplines on Qwen3.5-VL (not the same architecture; shown here for lineage context)
Citation
@misc{qovaryx-scratch-base-2026,
title = {Qovaryx: A Compact Decoder Architecture with Multi-Token Prediction, GQA, and Pluggable FFN Backends},
author = {Jarvis, Thomas},
year = {2026},
month = {May},
publisher = {Hugging Face},
url = {https://huggingface.co/tjarvis91/qovaryx-350m-scratch-base}
}
Status
Random-init checkpoint as of 2026-05-22. Future updates will add trained sibling repos with downstream task heads enabled (decision head + chart-patch encoder variants). Watch the org page for new releases.
- Downloads last month
- -
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tjarvis91/qovaryx-350m-scratch-base", trust_remote_code=True)