--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: text-generation inference: false widget: [] tags: - pretraining - from-scratch - random-init - scratch-base - decoder - causal-lm - causal-language-model - transformer - compact-llm - compact-ai - consumer-gpu - local-ai - on-device - edge-ai - 16gb-vram - rtx-4090 - rtx-5090 - rtx-5070-ti - small-language-model - tiny-llm - nanogpt - pretraining-template - mtp - multi-token-prediction - gqa - grouped-query-attention - swiglu - ternary-swiglu - bitnet - moe - mixture-of-experts - sparse-moe - low-rank - flash-attention-2 - blackwell - sm120 - pytorch - own-base - sovereign-ai model-index: - name: qovaryx-350m-scratch-base results: [] --- # Qovaryx 350M — Scratch Base (random-init) > **Compact AI is not small AI. A 350M-parameter trainable substrate engineered to punch above its weight class on a single consumer GPU. Random-init — bring your own corpus, train it from scratch. MTP-K=4, GQA, pluggable FFN backends (dense SwiGLU / ternary BitNet-style / sparse low-rank MoE), optional task-specific heads. Apache-2.0.** ## Compact ≠ small Frontier-scale models cost a small country's GPU budget to train and a data-center to serve. Most real applications don't need 70B params; they need **a focused 1B that does one thing extraordinarily well, fits in 16 GB of consumer VRAM, and stays on the hobbyist/researcher's local hardware** — no API key, no inference bill, no token-rate limit, no provider drift. The Qovaryx family is built around that thesis. Same component library at 50M / 350M / 1B sizes, all engineered to: - **Train on a single Blackwell-class consumer GPU** (RTX 4080 / 4090 / 5070 Ti / 5080 / 5090). 50M fits in <1 GB; 350M fits at batch=1 grad-accum on 12 GB; 1B fits in 16 GB with bf16 + `adamw_8bit`. - **Inference on local hardware** — no provider lock-in. A serious workstation runs the 1B at usable throughput; the 350M runs on a laptop. - **Pack modern components** into the smaller footprint: Multi-Token Prediction, GQA, ternary and sparse-MoE FFN backends, optional task heads. The architectural choices that make 70B models work also make 1B models punch above their weight class. This repo is the **random-init starting point** for that research program. **No pretraining has occurred — the model emits noise out of the box.** It exists so you can train the architecture on your own tokens, your own task, your own budget, without paying the wall-clock cost of recreating the scaffold. Think of it as a **trainable substrate** — like nanoGPT or the Pythia step-0 branches — but with a few modern components pre-wired: - **Multi-Token Prediction (MTP-K=4) heads** for jointly predicting up to 4 tokens ahead - **Grouped-Query Attention (GQA)** with configurable `n_head` / `n_kv_head` ratio (default 16:4) - **Pluggable FFN backends:** dense SwiGLU, ternary SwiGLU (BitNet-style with straight-through estimator), low-rank SwiGLU, **routed low-rank MoE** (4 experts top-1) - **Optional task heads:** 4-class decision head, raw-pixel chart-patch encoder (vision prefix tokens) — switchable via config - **Custom 20,242-vocab BPE tokenizer** — domain-leaning but broadly reusable - **Packed mmap shard format** for fast training on cheap consumer GPUs (one-time PackOnce compile, then mmap reads instead of per-row BPE) Trained on a single RTX 5070 Ti (16 GB, Blackwell sm_120) using PyTorch 2.7 + flash-attn 2.7.4 + bnb 0.49.2 (`adamw_8bit`). 8-bit optimizer + bf16 + length curriculum means a 50M-param sibling fits in <1 GB and a 1B sibling fits in 16 GB at batch=1. --- ## Why build on Qovaryx? **Compact AI is not small AI.** Frontier-scale models ask *how do we build the biggest intelligence possible?* Qovaryx asks the inverse: *how much disciplined intelligence can we extract per parameter, per watt, per GPU?* The published `*-scratch-base` checkpoints are the **trainable substrate** for that thesis. They are not pre-trained — they are the random-init starting point, engineered so that one person on one consumer GPU can take the architecture all the way to a focused specialist model without renting a data-center. | Dimension | Frontier closed (GPT-5, Claude, Gemini) | Frontier open (DeepSeek, Llama, Mistral, Qwen) | **Qovaryx** | |---|---|---|---| | **Primary philosophy** | Maximum general intelligence | Open-weight general foundation | Behavioral compression + corrective intelligence | | **Infrastructure** | Multi-datacenter clusters | Multi-GPU enterprise / cloud | ✅ **Single consumer GPU** (RTX 4080 / 4090 / 5070 Ti / 5080 / 5090) | | **Deployment** | Cloud / API only | Cloud or local (≥1× A100-class at the larger sizes) | ✅ **Local-first**, fits in 16 GB VRAM at every size | | **Cost model** | Very high compute + ongoing API spend | Moderate-high compute, lower at inference | ✅ **Consumer-grade** — power bill + GPU you already own | | **License** | Closed weights, ToS-gated | Open weights (license varies) | ✅ **Apache-2.0** weights + Apache-2.0 reference trainer | | **Behavioral control** | Mostly emergent / safety-layer | Fine-tune dependent | ✅ **Deterministic shell + crystal governance** — explicit, not emergent | | **Specialization strategy** | One giant universal model | General foundation, fine-tune downstream | ✅ **Modular specialists** composed via the same compact base | | **Confidence handling** | Opaque token probabilities | Token probabilities | ✅ **Calibrated 4-class decision head** (action-gate-style classifier, optional) | | **Multi-token prediction** | Generally next-token only | Generally next-token only | ✅ **MTP-K=4** built in (4-tokens-ahead joint head) | | **FFN options** | Dense | Dense or MoE (frontier sizes) | ✅ **Pluggable**: dense SwiGLU / ternary BitNet-style / sparse low-rank MoE — config flag | | **Attention** | MHA / GQA | GQA | ✅ **GQA** with configurable n_head:n_kv_head ratio | | **Training tokenizer** | Provider-controlled | Provider-controlled | ✅ **You bundle it** (20,242-vocab BPE shipped; replaceable) | | **Vision input** | Provider plugin | Provider plugin | ✅ **Optional raw-pixel chart-patch encoder** — switchable per-row at train time | ✅ = something Qovaryx provides out of the box on the scratch-base release. This is not a claim that Qovaryx beats GPT-5 on MMLU. It will not. **It is a claim that the right shape of small can do real work where the right shape of huge is unavailable, unaffordable, or unowned.** ### Why this base helps you build - **The components are already wired** — MTP-K, GQA, decision head, ternary/MoE FFN backends, chart patch encoder. Switchable via config. Skip three months of architecture work. - **It fits** — 50M fits anywhere; 350M fits on a 12 GB card; 1B fits on a 16 GB consumer card with `adamw_8bit` + bf16. You can actually train these on hardware you can actually buy. - **It's honest about what's withheld** — the architecture is open. The crystallization recipes, eval gold, verifier internals, and shell logic stay private. You build *on* Qovaryx's substrate; we don't pretend you're getting the whole stack. - **Apache-2.0** — research, hobby, commercial. Attribution appreciated, not legally required. ### Qovaryx is NOT trying to be - A frontier-IQ replacement - A benchmark champion on broad evals - A chat product - A substitute for engineering on the wrapper / verifier / shell — those are where compact AI earns its keep --- ## Sizes in this family — consumer-GPU first | Repo | Params | d_model | n_layer | n_head | n_kv_head | d_ff | VRAM @ training (bf16, adamw_8bit) | VRAM @ inference (bf16) | |---|---:|---:|---:|---:|---:|---:|---:|---:| | `tjarvis91/qovaryx-50m-scratch-base` | ~47M | 512 | 12 | 8 | 2 | 1408 | <1 GB | <0.5 GB | | **`tjarvis91/qovaryx-350m-scratch-base`** | **~352M** | **1024** | **24** | **16** | **4** | **2816** | **~3 GB** | **~1.5 GB** | | `tjarvis91/qovaryx-1b-scratch-base` | ~1.05B | 2048 | 22 | 16 | 4 | 5504 | ~12 GB | ~3 GB | All three share the same component library and tokenizer — pick the size your GPU can hold. **You do not need an A100 to train these.** A 16 GB consumer card handles every size in this family. A 12 GB card handles 50m + 350m comfortably. A 24 GB card lets you push 1B with larger batches. --- ## TL;DR — what's in this repo | File | Purpose | |---|---| | `config.json` | Architecture spec (`DecoderConfig`) — d_model, n_layer, FFN kind, MTP-K, GQA ratio, vocab, max_seq_len | | `pytorch_model.bin` | Random-init weights (Glorot/Xavier per layer kind), bf16 | | `tokenizer.json` | 20,242-vocab BPE (custom; domain-leaning but general-purpose) | | `tokenizer_config.json` | Tokenizer wrapping config | | `generation_config.json` | Default sampling params | | `modeling_qovaryx.py` | `FinanceDecoder` class (named for legacy reasons; the class is task-agnostic) + heads + FFN backends | | `train_quickstart.py` | A nanoGPT-style 200-line training loop you can run today | | `README.md` | This card | The model uses `trust_remote_code=True` (custom architecture). Load it like any other HF model. --- ## Quickstart ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer tok = AutoTokenizer.from_pretrained("tjarvis91/qovaryx-350m-scratch-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( "tjarvis91/qovaryx-350m-scratch-base", trust_remote_code=True, torch_dtype=torch.bfloat16, ).cuda() # Out-of-the-box this generates noise — model is random-init by design. # Train it on your own corpus, then it will be useful. out = model.generate(tok("hello", return_tensors="pt").input_ids.cuda(), max_new_tokens=20) print(tok.decode(out[0])) ``` Minimal training loop (single GPU, bf16, AdamW): ```python import torch from torch.utils.data import DataLoader opt = torch.optim.AdamW(model.parameters(), lr=2e-4, weight_decay=0.1, betas=(0.9, 0.95)) for step, batch in enumerate(your_dataloader): batch = {k: v.cuda() for k, v in batch.items()} with torch.amp.autocast("cuda", dtype=torch.bfloat16): out = model(**batch, labels=batch["input_ids"]) out.loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) opt.step(); opt.zero_grad() if step % 10 == 0: print(f"step={step} loss={out.loss.item():.4f}") ``` A full reference recipe (length curriculum + MTP-K + decision-head + packed shards + `adamw_8bit` for 16 GB cards) is in `train_quickstart.py`. --- ## FFN backends — switchable via config Set `ffn_kind` in `config.json` (or via `from_pretrained(..., ffn_kind=...)`): | `ffn_kind` | Description | When to use | |---|---|---| | `swiglu` | Dense SwiGLU (the obvious baseline) | Default. Fastest wall-clock per step. | | `ternary_swiglu` | BitNet-style ternary weights with straight-through estimator | When you care about deployable model size and accept ~3× slower training | | `lowrank_swiglu` | Factorized projections (rank `ffn_rank`) | Param compression without sparsity | | `routed_lowrank_swiglu` | Sparse MoE: `ffn_experts` top-`ffn_top_k` routing | When you want capacity without dense FLOPs | These are inspired by published work (BitNet, DeepSeek-V3 MTP, Mixtral, GShard, ST-MoE). The novelty here is that all four share **one trainer**, **one tokenizer**, and **one packed-shard pipeline** — so switching backends is a config edit, not a fork. --- ## Optional task heads The base architecture exposes two opt-in heads, off by default: - **`decision_head_enabled`** — 4-class classification head pooled at a chosen token position. Useful for downstream policy / preference / structured-action tasks. Co-trained via masked CE. - **`chart_patch_encoder_enabled`** — strided-Conv2d raw-pixel encoder that converts an input image into prefix tokens, fed into the causal decoder before the text tokens. Useful for any text+image task; not specific to charts despite the name. Both can be turned on per-row at training time (the trainer reads per-example metadata), so you can mix unimodal and multimodal rows in the same shard. Both are random-init in this repo and need to be trained alongside the LM head if you use them. --- ## Suggested training recipes These are **starting points** — tune to your data. Single 5070 Ti / RTX 4080-class GPU assumed. ### 50M baseline (LM only) ``` target_tokens: 500M-2B tokens_per_batch: 4096 grad_accum_steps: 8 max_seq_len: 2048 length_curriculum: (512,1000)(1024,3000)(2048,10000)(4096,-1) lr: 2e-4 warmup_steps: 500 weight_decay: 0.1 optimizer: adamw_8bit (bf16) attn_backend: flash (FA2 if available, else PyTorch SDPA) ffn_kind: swiglu mtp_weight: 0.3 ``` ### 350M with MTP + decision head ``` target_tokens: 5B-20B tokens_per_batch: 8192 grad_accum_steps: 16 max_seq_len: 4096 ffn_kind: ternary_swiglu (or swiglu) mtp_weight: 0.3 decision_weight: 0.5 class_weighted_decision: true calibration_loss_weight: 0.2 (if you want a confidence-calibrated head) ``` ### 1B with sparse MoE ``` target_tokens: 50B-200B ffn_kind: routed_lowrank_swiglu ffn_rank: 128 ffn_experts: 4 ffn_top_k: 1 mixed_precision: bf16 optimizer: adamw_8bit ``` --- ## What this is NOT - **Not a pretrained model.** Out-of-the-box outputs are noise. Random initialization is the entire point. - **Not finance-specific** despite the legacy class name `FinanceDecoder`. The architecture is task-agnostic; the BPE tokenizer leans toward finance-aware merges but works on any English text. - **Not a drop-in replacement** for Llama / Qwen / Mistral. The component set is different (MTP-K heads in particular need their own training term). - **Not adversarially robust.** It's a substrate. - **Not a tiny / toy model.** 1B params at bf16 hits 2 GB on disk; trained well, it competes seriously on focused tasks. "Compact" means efficient, not weak. --- ## License Apache-2.0. Use it for research, commercial work, hobby projects — whatever. Attribution appreciated but not legally required. --- ## Research notes Qovaryx is part of a broader local-sovereign-AI research program. Higher-level framings, architectural rationale, and ablation studies are published progressively at: **Research index:** https://github.com/thron-j/qovaryx-ai-research Implementation details, training corpora, and certain ablation specifics are intentionally withheld in the public devlog. The framings are publishable; the internals are not. Collaboration inquiries: jeherizonllc@gmail.com. --- ## Support If this base helps you build something, support continued development: **☕ [ko-fi.com/tjarvis91](https://ko-fi.com/tjarvis91)** Every contribution funds GPU time and the next-generation Qovaryx training runs. --- ## Sibling models in this lineage - [`tjarvis91/qovaryx-50m-scratch-base`](https://huggingface.co/tjarvis91/qovaryx-50m-scratch-base) -- 47M params, 12 layers, fits on any GPU - **`tjarvis91/qovaryx-350m-scratch-base`** <- you are here - [`tjarvis91/qovaryx-1b-scratch-base`](https://huggingface.co/tjarvis91/qovaryx-1b-scratch-base) -- 1.05B params, 22 layers, the full consumer-GPU target - [`tjarvis91/vfaix-vpa-options-trader`](https://huggingface.co/tjarvis91/vfaix-vpa-options-trader) -- a separate, trained 9B vision-language model that uses the same training disciplines on Qwen3.5-VL (not the same architecture; shown here for lineage context) --- ## Citation ```bibtex @misc{qovaryx-scratch-base-2026, title = {Qovaryx: A Compact Decoder Architecture with Multi-Token Prediction, GQA, and Pluggable FFN Backends}, author = {Jarvis, Thomas}, year = {2026}, month = {May}, publisher = {Hugging Face}, url = {https://huggingface.co/tjarvis91/qovaryx-350m-scratch-base} } ``` --- ## Status Random-init checkpoint as of 2026-05-22. Future updates will add **trained sibling repos** with downstream task heads enabled (decision head + chart-patch encoder variants). Watch the org page for new releases.