---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
inference: false
widget: []
tags:
- pretraining
- from-scratch
- random-init
- scratch-base
- decoder
- causal-lm
- causal-language-model
- transformer
- compact-llm
- compact-ai
- consumer-gpu
- local-ai
- on-device
- edge-ai
- 16gb-vram
- rtx-4090
- rtx-5090
- rtx-5070-ti
- small-language-model
- tiny-llm
- nanogpt
- pretraining-template
- mtp
- multi-token-prediction
- gqa
- grouped-query-attention
- swiglu
- ternary-swiglu
- bitnet
- moe
- mixture-of-experts
- sparse-moe
- low-rank
- flash-attention-2
- blackwell
- sm120
- pytorch
- own-base
- sovereign-ai
model-index:
- name: qovaryx-350m-scratch-base
  results: []
---

# Qovaryx 350M — Scratch Base (random-init)

> **Compact AI is not small AI. A 350M-parameter trainable substrate engineered to punch above its weight class on a single consumer GPU. Random-init — bring your own corpus, train it from scratch. MTP-K=4, GQA, pluggable FFN backends (dense SwiGLU / ternary BitNet-style / sparse low-rank MoE), optional task-specific heads. Apache-2.0.**

## Compact ≠ small

Frontier-scale models cost a small country's GPU budget to train and a data-center to serve. Most real applications don't need 70B params; they need **a focused 1B that does one thing extraordinarily well, fits in 16 GB of consumer VRAM, and stays on the hobbyist/researcher's local hardware** — no API key, no inference bill, no token-rate limit, no provider drift.

The Qovaryx family is built around that thesis. Same component library at 50M / 350M / 1B sizes, all engineered to:

- **Train on a single Blackwell-class consumer GPU** (RTX 4080 / 4090 / 5070 Ti / 5080 / 5090). 50M fits in <1 GB; 350M fits at batch=1 grad-accum on 12 GB; 1B fits in 16 GB with bf16 + `adamw_8bit`.
- **Inference on local hardware** — no provider lock-in. A serious workstation runs the 1B at usable throughput; the 350M runs on a laptop.
- **Pack modern components** into the smaller footprint: Multi-Token Prediction, GQA, ternary and sparse-MoE FFN backends, optional task heads. The architectural choices that make 70B models work also make 1B models punch above their weight class.

This repo is the **random-init starting point** for that research program. **No pretraining has occurred — the model emits noise out of the box.** It exists so you can train the architecture on your own tokens, your own task, your own budget, without paying the wall-clock cost of recreating the scaffold.

Think of it as a **trainable substrate** — like nanoGPT or the Pythia step-0 branches — but with a few modern components pre-wired:

- **Multi-Token Prediction (MTP-K=4) heads** for jointly predicting up to 4 tokens ahead
- **Grouped-Query Attention (GQA)** with configurable `n_head` / `n_kv_head` ratio (default 16:4)
- **Pluggable FFN backends:** dense SwiGLU, ternary SwiGLU (BitNet-style with straight-through estimator), low-rank SwiGLU, **routed low-rank MoE** (4 experts top-1)
- **Optional task heads:** 4-class decision head, raw-pixel chart-patch encoder (vision prefix tokens) — switchable via config
- **Custom 20,242-vocab BPE tokenizer** — domain-leaning but broadly reusable
- **Packed mmap shard format** for fast training on cheap consumer GPUs (one-time PackOnce compile, then mmap reads instead of per-row BPE)

Trained on a single RTX 5070 Ti (16 GB, Blackwell sm_120) using PyTorch 2.7 + flash-attn 2.7.4 + bnb 0.49.2 (`adamw_8bit`). 8-bit optimizer + bf16 + length curriculum means a 50M-param sibling fits in <1 GB and a 1B sibling fits in 16 GB at batch=1.

---

## Why build on Qovaryx?

**Compact AI is not small AI.** Frontier-scale models ask *how do we build the biggest intelligence possible?* Qovaryx asks the inverse: *how much disciplined intelligence can we extract per parameter, per watt, per GPU?*

The published `*-scratch-base` checkpoints are the **trainable substrate** for that thesis. They are not pre-trained — they are the random-init starting point, engineered so that one person on one consumer GPU can take the architecture all the way to a focused specialist model without renting a data-center.

| Dimension | Frontier closed (GPT-5, Claude, Gemini) | Frontier open (DeepSeek, Llama, Mistral, Qwen) | **Qovaryx** |
|---|---|---|---|
| **Primary philosophy** | Maximum general intelligence | Open-weight general foundation | Behavioral compression + corrective intelligence |
| **Infrastructure** | Multi-datacenter clusters | Multi-GPU enterprise / cloud | ✅ **Single consumer GPU** (RTX 4080 / 4090 / 5070 Ti / 5080 / 5090) |
| **Deployment** | Cloud / API only | Cloud or local (≥1× A100-class at the larger sizes) | ✅ **Local-first**, fits in 16 GB VRAM at every size |
| **Cost model** | Very high compute + ongoing API spend | Moderate-high compute, lower at inference | ✅ **Consumer-grade** — power bill + GPU you already own |
| **License** | Closed weights, ToS-gated | Open weights (license varies) | ✅ **Apache-2.0** weights + Apache-2.0 reference trainer |
| **Behavioral control** | Mostly emergent / safety-layer | Fine-tune dependent | ✅ **Deterministic shell + crystal governance** — explicit, not emergent |
| **Specialization strategy** | One giant universal model | General foundation, fine-tune downstream | ✅ **Modular specialists** composed via the same compact base |
| **Confidence handling** | Opaque token probabilities | Token probabilities | ✅ **Calibrated 4-class decision head** (action-gate-style classifier, optional) |
| **Multi-token prediction** | Generally next-token only | Generally next-token only | ✅ **MTP-K=4** built in (4-tokens-ahead joint head) |
| **FFN options** | Dense | Dense or MoE (frontier sizes) | ✅ **Pluggable**: dense SwiGLU / ternary BitNet-style / sparse low-rank MoE — config flag |
| **Attention** | MHA / GQA | GQA | ✅ **GQA** with configurable n_head:n_kv_head ratio |
| **Training tokenizer** | Provider-controlled | Provider-controlled | ✅ **You bundle it** (20,242-vocab BPE shipped; replaceable) |
| **Vision input** | Provider plugin | Provider plugin | ✅ **Optional raw-pixel chart-patch encoder** — switchable per-row at train time |

✅ = something Qovaryx provides out of the box on the scratch-base release.

This is not a claim that Qovaryx beats GPT-5 on MMLU. It will not. **It is a claim that the right shape of small can do real work where the right shape of huge is unavailable, unaffordable, or unowned.**

### Why this base helps you build

- **The components are already wired** — MTP-K, GQA, decision head, ternary/MoE FFN backends, chart patch encoder. Switchable via config. Skip three months of architecture work.
- **It fits** — 50M fits anywhere; 350M fits on a 12 GB card; 1B fits on a 16 GB consumer card with `adamw_8bit` + bf16. You can actually train these on hardware you can actually buy.
- **It's honest about what's withheld** — the architecture is open. The crystallization recipes, eval gold, verifier internals, and shell logic stay private. You build *on* Qovaryx's substrate; we don't pretend you're getting the whole stack.
- **Apache-2.0** — research, hobby, commercial. Attribution appreciated, not legally required.

### Qovaryx is NOT trying to be

- A frontier-IQ replacement
- A benchmark champion on broad evals
- A chat product
- A substitute for engineering on the wrapper / verifier / shell — those are where compact AI earns its keep

---

## Sizes in this family — consumer-GPU first

| Repo | Params | d_model | n_layer | n_head | n_kv_head | d_ff | VRAM @ training (bf16, adamw_8bit) | VRAM @ inference (bf16) |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| `tjarvis91/qovaryx-50m-scratch-base` | ~47M | 512 | 12 | 8 | 2 | 1408 | <1 GB | <0.5 GB |
| **`tjarvis91/qovaryx-350m-scratch-base`** | **~352M** | **1024** | **24** | **16** | **4** | **2816** | **~3 GB** | **~1.5 GB** |
| `tjarvis91/qovaryx-1b-scratch-base` | ~1.05B | 2048 | 22 | 16 | 4 | 5504 | ~12 GB | ~3 GB |

All three share the same component library and tokenizer — pick the size your GPU can hold. **You do not need an A100 to train these.** A 16 GB consumer card handles every size in this family. A 12 GB card handles 50m + 350m comfortably. A 24 GB card lets you push 1B with larger batches.

---

## TL;DR — what's in this repo

| File | Purpose |
|---|---|
| `config.json` | Architecture spec (`DecoderConfig`) — d_model, n_layer, FFN kind, MTP-K, GQA ratio, vocab, max_seq_len |
| `pytorch_model.bin` | Random-init weights (Glorot/Xavier per layer kind), bf16 |
| `tokenizer.json` | 20,242-vocab BPE (custom; domain-leaning but general-purpose) |
| `tokenizer_config.json` | Tokenizer wrapping config |
| `generation_config.json` | Default sampling params |
| `modeling_qovaryx.py` | `FinanceDecoder` class (named for legacy reasons; the class is task-agnostic) + heads + FFN backends |
| `train_quickstart.py` | A nanoGPT-style 200-line training loop you can run today |
| `README.md` | This card |

The model uses `trust_remote_code=True` (custom architecture). Load it like any other HF model.

---

## Quickstart

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("tjarvis91/qovaryx-350m-scratch-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "tjarvis91/qovaryx-350m-scratch-base",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda()

# Out-of-the-box this generates noise — model is random-init by design.
# Train it on your own corpus, then it will be useful.
out = model.generate(tok("hello", return_tensors="pt").input_ids.cuda(), max_new_tokens=20)
print(tok.decode(out[0]))
```

Minimal training loop (single GPU, bf16, AdamW):

```python
import torch
from torch.utils.data import DataLoader

opt = torch.optim.AdamW(model.parameters(), lr=2e-4, weight_decay=0.1, betas=(0.9, 0.95))
for step, batch in enumerate(your_dataloader):
    batch = {k: v.cuda() for k, v in batch.items()}
    with torch.amp.autocast("cuda", dtype=torch.bfloat16):
        out = model(**batch, labels=batch["input_ids"])
    out.loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    opt.step(); opt.zero_grad()
    if step % 10 == 0:
        print(f"step={step} loss={out.loss.item():.4f}")
```

A full reference recipe (length curriculum + MTP-K + decision-head + packed shards + `adamw_8bit` for 16 GB cards) is in `train_quickstart.py`.

---

## FFN backends — switchable via config

Set `ffn_kind` in `config.json` (or via `from_pretrained(..., ffn_kind=...)`):

| `ffn_kind` | Description | When to use |
|---|---|---|
| `swiglu` | Dense SwiGLU (the obvious baseline) | Default. Fastest wall-clock per step. |
| `ternary_swiglu` | BitNet-style ternary weights with straight-through estimator | When you care about deployable model size and accept ~3× slower training |
| `lowrank_swiglu` | Factorized projections (rank `ffn_rank`) | Param compression without sparsity |
| `routed_lowrank_swiglu` | Sparse MoE: `ffn_experts` top-`ffn_top_k` routing | When you want capacity without dense FLOPs |

These are inspired by published work (BitNet, DeepSeek-V3 MTP, Mixtral, GShard, ST-MoE). The novelty here is that all four share **one trainer**, **one tokenizer**, and **one packed-shard pipeline** — so switching backends is a config edit, not a fork.

---

## Optional task heads

The base architecture exposes two opt-in heads, off by default:

- **`decision_head_enabled`** — 4-class classification head pooled at a chosen token position. Useful for downstream policy / preference / structured-action tasks. Co-trained via masked CE.
- **`chart_patch_encoder_enabled`** — strided-Conv2d raw-pixel encoder that converts an input image into prefix tokens, fed into the causal decoder before the text tokens. Useful for any text+image task; not specific to charts despite the name.

Both can be turned on per-row at training time (the trainer reads per-example metadata), so you can mix unimodal and multimodal rows in the same shard. Both are random-init in this repo and need to be trained alongside the LM head if you use them.

---

## Suggested training recipes

These are **starting points** — tune to your data. Single 5070 Ti / RTX 4080-class GPU assumed.

### 50M baseline (LM only)

```
target_tokens:           500M-2B
tokens_per_batch:        4096
grad_accum_steps:        8
max_seq_len:             2048
length_curriculum:       (512,1000)(1024,3000)(2048,10000)(4096,-1)
lr:                      2e-4
warmup_steps:            500
weight_decay:            0.1
optimizer:               adamw_8bit (bf16)
attn_backend:            flash (FA2 if available, else PyTorch SDPA)
ffn_kind:                swiglu
mtp_weight:              0.3
```

### 350M with MTP + decision head

```
target_tokens:           5B-20B
tokens_per_batch:        8192
grad_accum_steps:        16
max_seq_len:             4096
ffn_kind:                ternary_swiglu  (or swiglu)
mtp_weight:              0.3
decision_weight:         0.5
class_weighted_decision: true
calibration_loss_weight: 0.2  (if you want a confidence-calibrated head)
```

### 1B with sparse MoE

```
target_tokens:           50B-200B
ffn_kind:                routed_lowrank_swiglu
ffn_rank:                128
ffn_experts:             4
ffn_top_k:               1
mixed_precision:         bf16
optimizer:               adamw_8bit
```

---

## What this is NOT

- **Not a pretrained model.** Out-of-the-box outputs are noise. Random initialization is the entire point.
- **Not finance-specific** despite the legacy class name `FinanceDecoder`. The architecture is task-agnostic; the BPE tokenizer leans toward finance-aware merges but works on any English text.
- **Not a drop-in replacement** for Llama / Qwen / Mistral. The component set is different (MTP-K heads in particular need their own training term).
- **Not adversarially robust.** It's a substrate.
- **Not a tiny / toy model.** 1B params at bf16 hits 2 GB on disk; trained well, it competes seriously on focused tasks. "Compact" means efficient, not weak.

---

## License

Apache-2.0. Use it for research, commercial work, hobby projects — whatever. Attribution appreciated but not legally required.

---

## Research notes

Qovaryx is part of a broader local-sovereign-AI research program. Higher-level framings, architectural rationale, and ablation studies are published progressively at:

**Research index:** https://github.com/thron-j/qovaryx-ai-research

Implementation details, training corpora, and certain ablation specifics are intentionally withheld in the public devlog. The framings are publishable; the internals are not. Collaboration inquiries: jeherizonllc@gmail.com.

---

## Support

If this base helps you build something, support continued development:

**☕ [ko-fi.com/tjarvis91](https://ko-fi.com/tjarvis91)**

Every contribution funds GPU time and the next-generation Qovaryx training runs.

---

## Sibling models in this lineage

- [`tjarvis91/qovaryx-50m-scratch-base`](https://huggingface.co/tjarvis91/qovaryx-50m-scratch-base) -- 47M params, 12 layers, fits on any GPU
- **`tjarvis91/qovaryx-350m-scratch-base`** <- you are here
- [`tjarvis91/qovaryx-1b-scratch-base`](https://huggingface.co/tjarvis91/qovaryx-1b-scratch-base) -- 1.05B params, 22 layers, the full consumer-GPU target
- [`tjarvis91/vfaix-vpa-options-trader`](https://huggingface.co/tjarvis91/vfaix-vpa-options-trader) -- a separate, trained 9B vision-language model that uses the same training disciplines on Qwen3.5-VL (not the same architecture; shown here for lineage context)

---

## Citation

```bibtex
@misc{qovaryx-scratch-base-2026,
  title     = {Qovaryx: A Compact Decoder Architecture with Multi-Token Prediction, GQA, and Pluggable FFN Backends},
  author    = {Jarvis, Thomas},
  year      = {2026},
  month     = {May},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/tjarvis91/qovaryx-350m-scratch-base}
}
```

---

## Status

Random-init checkpoint as of 2026-05-22. Future updates will add **trained sibling repos** with downstream task heads enabled (decision head + chart-patch encoder variants). Watch the org page for new releases.