--- license: apache-2.0 language: - en base_model: tjarvis91/qovaryx-50m-scratch-base base_model_relation: finetune library_name: pytorch pipeline_tag: text-generation tags: - text-generation - rag - retrieval-augmented-generation - reranker - relevance-scoring - sovereign-base - on-device - edge-ai - qovaryx - compact-cognition - local-ai - cross-domain - factual-grounding - anti-hallucination - 50m - bpe - english - small-language-model - slm - beir --- # Q-RAG-50M-Sovereign — the sovereign retrieval head that punches above its weight > **A 50M-parameter relevance scorer that beats BGE-reranker-large (560M, 11× larger) on in-distribution refusal and ties or beats 4 of 11 tested rerankers/embeddings on out-of-distribution BEIR — at 50M params, on CPU, fully sovereign.** ## What this model does, in one sentence Given a USER query and a CANDIDATE passage, Q-RAG outputs **exactly one character** — `1` if the passage is relevant to the query, `0` if it is not — making it a drop-in relevance filter for any RAG (retrieval-augmented generation) pipeline. ## Headline: where Q-RAG wins, where it loses, why both matter ### In-distribution (10-domain Q-RAG holdout, 30 rows): **#1 of 11** Q-RAG was trained on **cross-domain refusal as a first-class objective** — every query paired with both same-domain near-miss adversaries and cross-domain off-topic passages. On the holdout that tests exactly this, **Q-RAG beats every model we evaluated**, including BGE-reranker-large (560M) and BGE-reranker-v2-m3 (568M) — 11× our parameter count. | Rank | Model | Params | Acc | Carry-12 | Cross-18 | |---:|---|---:|---:|---:|---:| | **1** | **Q-RAG-50M-Sovereign** | **50M** | **100.0%** | **100.0%** | **100.0%** | | 2 | bge-reranker-large | 560M | 96.7% | 100.0% | 94.4% | | 2 | bge-reranker-v2-m3 | 568M | 96.7% | 100.0% | 94.4% | | 4 | ms-marco-MiniLM-L-6-v2 | 23M | 93.3% | 100.0% | 88.9% | | 4 | ms-marco-MiniLM-L-12-v2 | 33M | 93.3% | 100.0% | 88.9% | | 4 | mxbai-rerank-xsmall-v1 | 70M | 93.3% | 100.0% | 88.9% | | 4 | gte-reranker-modernbert-base | 149M | 93.3% | 100.0% | 88.9% | | 8 | e5-small-v2 | 33M | 90.0% | 100.0% | 83.3% | | 8 | bge-reranker-base | 278M | 90.0% | 100.0% | 83.3% | | 10 | bge-small-en-v1.5 | 33M | 86.7% | 100.0% | 77.8% | | 10 | bge-m3 | 568M | 86.7% | 91.7% | 83.3% | All baselines are at their *oracle* threshold (the threshold chosen to maximize their accuracy on the full holdout — a generous upper bound). Q-RAG outputs `1` or `0` directly with no threshold to tune. ### Out-of-distribution (BEIR NFCorpus + SciFact slice, 250 rows): **rank 9 of 12** — but the gap is tiny We also tested on **BEIR**, a public IR benchmark. The slice combines NFCorpus (medical literature retrieval) and SciFact (scientific claim verification) — domains Q-RAG was **not** trained on. 25 queries each, 1 positive + 4 hard negatives per query. | Rank | Model | Params | BEIR Acc | Lat (ms) | |---:|---|---:|---:|---:| | 1 | bge-small-en-v1.5 | 33M | 93.2% | 38 | | 2 | ms-marco-MiniLM-L-6-v2 | 23M | 92.4% | 19 | | 2 | gte-reranker-modernbert-base | 149M | 92.4% | 147 | | 4 | e5-small-v2 | 33M | 92.0% | 37 | | 5 | bge-reranker-v2-m3 | 568M | 90.8% | 391 | | 5 | bge-m3 | 568M | 90.8% | 396 | | 7 | ms-marco-MiniLM-L-12-v2 | 33M | 90.4% | 38 | | 7 | bge-reranker-base | 278M | 90.4% | 119 | | **9** | **Q-RAG-50M-Sovereign** | **50M** | **89.6%** | **168** | | 9 | mxbai-rerank-xsmall-v1 | 70M | 89.6% | 919 | | 11 | bge-reranker-large | 560M | 88.4% | 392 | **Honest reading.** On medical+scientific OOD, Q-RAG **lands rank 9 of 12** at 89.6%. But the field is tight: only **3.6 points** separate the leader (bge-small-en-v1.5 at 93.2%) from Q-RAG, and Q-RAG **outright beats BGE-reranker-large (560M, 11× larger)** by 1.2 points and **ties mxbai-rerank-xsmall**. Models like BGE-reranker-v2-m3 and bge-m3 (568M) finish only 1.2 points ahead of us at over 10× the size. Models with 11× our parameters are not 11× better at this task — the curve flattens hard. That's what "punching above your weight" looks like: a 50M model trading punches with 560M-parameter rerankers on data it wasn't even trained on, while still being **#1 on the data it was trained for**. ## How Q-RAG punches above its weight Three technical choices, applied together, produce the result above. None are individually novel; the combination is what works at 50M params. ### 1. Cross-domain refusal as a **first-class training objective**, not a side effect Most retrieval models — embeddings and rerankers alike — are trained on positive ranking signal (MS MARCO click-through, NLI entailment, etc.). They learn what "more relevant" looks like, then hope the threshold separates the relevant from the irrelevant. Q-RAG was trained explicitly on **cross-domain off-topic refusal** — every query in the corpus was paired against 5 passages drawn from *other* domains, labeled `0`, and weighted **higher** than the positives during the loss computation. The model learned that the default answer for "wrong domain" is *refuse*, not *score it low and hope the threshold catches it*. The result: 100% on the cross-domain refusal subset, where bge-m3 (568M) drops to 83.3%. ### 2. Adversarial **same-domain near-miss negatives** The hardest failure for an embedding model is a same-shape-but-wrong-specific-answer passage. "Paris is the capital of France" sits near "Berlin is the capital of Germany" in embedding space — same sentence structure, same topic family, same vocabulary register. The cosine similarity says yes; relevance says no. For every topic in training, Q-RAG sees 4–6 same-domain wrong-specific-answer passages weighted **even higher** than the positives. The model learned the *shape* of "wrong-but-shaped-right" and refuses cleanly. This is the failure mode that drives most production RAG hallucinations. ### 3. **Binary token output**, not a score Embedding models output a vector; you compare via cosine and choose a threshold. Rerankers output a logit; you choose a threshold. Both leave the calibration as the operator's problem — and the right threshold depends on the domain, the retriever upstream, and the size of the candidate set. Q-RAG outputs **a single token**: `1` or `0`. No threshold to tune. No calibration per pipeline. Drop it in after your dense retriever; pass through every passage that scores `1`; refuse if none do. The training objective is binary cross-entropy on that exact token; the inference path is a single argmax on the next-token distribution. No magic. The result is a small, fast head you put **after** your dense retriever to filter relevant passages before paying token cost on a 7B+ answer model. ## Are we new? Yes — and we trained from a sovereign base Q-RAG is **53.5M parameters** and was full-fine-tuned from [`tjarvis91/qovaryx-50m-scratch-base`](https://huggingface.co/tjarvis91/qovaryx-50m-scratch-base) — a base we pretrained ourselves from random initialization on 491.5M tokens with our own BPE tokenizer (`english_v1`, vocab 32000). **Not** SmolLM2. **Not** Qwen. **Not** Llama. **Not** Mistral. **Not** Phi. No borrowed foundation model. No closed-source weights. Every parameter traces back to a Qovaryx training run on Qovaryx hardware. That matters for two reasons: 1. **No license entanglement** — Apache 2.0 all the way down, full audit trail in this repo. 2. **No baked-in priors from someone else's training set** — when we say Q-RAG was trained on cross-domain refusal, we mean it didn't see the BEIR test set or anything contaminated with it during base pretraining either. ## What problem this actually solves You're already running RAG. Your dense retriever returns top-k passages. Some are relevant. Some are not. You don't want to pay for an LLM call on the not-relevant ones, and you don't want them in the answer model's context wasting attention. Q-RAG is the relevance filter between retrieve and generate. | Step | What you had | What Q-RAG adds | |---|---|---| | 1. Retrieve top-k passages | dense embedding model | (unchanged) | | **2. Filter for relevance** | — usually skipped | **Q-RAG: 1 forward pass per passage, output 1 or 0** | | 3. Generate answer | big LLM with all k passages | big LLM with only the relevant ones | Pipeline impact: - **Cheaper** — generation cost only on relevant passages. - **More accurate** — fewer red-herring passages in the answer model's context. - **More refusable** — if Q-RAG drops every passage, the system knows to say "I don't have evidence to answer that" instead of hallucinating. ## How to load it (Python) ```python import torch from tokenizers import Tokenizer from bleeding_edge.model.decoder import FinanceDecoder, DecoderConfig tok = Tokenizer.from_file("tokenizer.json") ckpt = torch.load("pytorch_model.pt", map_location="cpu", weights_only=False) cfg = DecoderConfig(**{k: v for k, v in ckpt["model_cfg"].items() if k in DecoderConfig.__dataclass_fields__}) cfg.vocab_size = tok.get_vocab_size() model = FinanceDecoder(cfg).eval() state = {k.removeprefix("_orig_mod."): v for k, v in ckpt["model_state"].items()} model.load_state_dict(state, strict=False) SYSTEM = ( "You are Q-Retriever. Given a USER query and a CANDIDATE passage, " "decide whether the passage is relevant to the query. " "Output exactly one character: 1 if relevant, 0 if not relevant. " "Refuse to invent relevance: if the passage does not address the query, output 0." ) def score(query: str, passage: str) -> int: prompt = f"{SYSTEM}\n\nUSER: Q: {query}\n\nPASSAGE:\n{passage}\n\nASSISTANT: " ids = tok.encode(prompt).ids cur = torch.tensor([ids], dtype=torch.long) with torch.no_grad(): nxt = int(torch.argmax(model(cur, return_decision=False).logits[:, -1, :], dim=-1)) return 1 if tok.decode([nxt]).strip() == "1" else 0 print(score("capital of Germany", "Berlin is the capital of Germany.")) # 1 print(score("capital of Germany", "Paris is the capital of France.")) # 0 print(score("how to git commit", "The Nile is the longest river.")) # 0 ``` ## Architecture (Qovaryx proprietary FinanceDecoder) - 53.5M parameters - 12 decoder blocks, d_model = 512, n_head = 8, GQA n_kv_head = 2 - SwiGLU FFN, RoPE positional, RMSNorm - Multi-token prediction (MTP) auxiliary heads - Decision head for routed-decision tasks - Tokenizer: Qovaryx `english_v1` BPE, vocab 32000 (in-house) - Pretrained from `qovaryx-50m-scratch-base` step 60000 → 491.5M tokens - Full fine-tune (no LoRA, no QLoRA, no adapter): every parameter was updated on the Qovaryx Q-RAG crystal corpus ## What this model is NOT - **Not a sentence embedding model.** No vector output. Use it *after* your dense retriever, not instead. - **Not a general-purpose chatbot.** Free-text generation outside the relevance-scoring task surface will degrade. - **Not the top BEIR scorer** — bge-small-en-v1.5 is 3.6 points ahead on BEIR. If your retrieval is exclusively medical/scientific OOD, run that baseline. - **Not reproducible from this card.** Weights, holdouts, and benchmark numbers are public; the crystal corpus generator and training hyperparameters are not. ## License & posture Apache 2.0 for the published weights, model card, holdouts, and benchmark JSONs. The Qovaryx scratch base build pipeline, the Q-RAG crystal corpus generator, the eval gate constants, the cluster routing policy, and the protected runtime entrypoint are **Qovaryx proprietary technology** and are not included. ## Reproduction & artifacts in this repo - `pytorch_model.pt` — Q-RAG weights (v10, 205 MB) - `tokenizer.json` — Qovaryx english_v1 BPE - `config.json` — model config - `holdout_eval.json` — full per-row in-house holdout result (30/30 = 100%) - `benchmark_vs_embeddings.json` — in-house holdout vs 10 baselines (Q-RAG #1) - `benchmark_beir.json` — BEIR NFCorpus+SciFact slice vs same baselines - Reproduction scripts: `scripts/benchmark_q_rag_vs_embeddings.py` and `scripts/benchmark_q_rag_vs_rerankers_beir.py` in the upstream research repo ## Sibling specialists in the Qovaryx Compact Specialist Suite All ten specialists share the `qovaryx-50m-scratch-base` and the same audit discipline. Use one directly; use all ten through the cluster shell. - [Q-Triage](https://huggingface.co/tjarvis91/Q-Triage-50M-Sovereign) — ticket routing - [Q-DocCite](https://huggingface.co/tjarvis91/Q-DocCite-50M-Sovereign) — document citation - [Q-Invoice](https://huggingface.co/tjarvis91/Q-Invoice-50M-Sovereign) — invoice extraction - [Q-ToolCall](https://huggingface.co/tjarvis91/Q-ToolCall-50M-Sovereign) — agent tool-calls - [Q-Meeting](https://huggingface.co/tjarvis91/Q-Meeting-50M-Sovereign) — meeting structuring - [Q-FinCite](https://huggingface.co/tjarvis91/Q-FinCite-50M-Sovereign) — 10-K/10-Q citation - [Q-CmdSafe](https://huggingface.co/tjarvis91/Q-CmdSafe-50M-Sovereign) — command safety triage - [Q-SheetExtract](https://huggingface.co/tjarvis91/Q-SheetExtract-50M-Sovereign) — spreadsheet extraction - [Q-Coder](https://huggingface.co/tjarvis91/Q-Coder-50M-Sovereign) — Python code skeletons - **Q-RAG (this model)** — relevance filter for RAG ## Reproduction invitation If you run Q-RAG against a model not in our table — Cohere Rerank, Voyage Rerank, jina-reranker-v2, ColBERT, or anything else — please open a discussion on this repo with the numbers. We'll add it to the card, honestly, whichever direction the result falls. The benchmark script + holdouts are in this repo. ## Official site & community The full Qovaryx runtime that orchestrates this specialist alongside the other nine ships from: - **Site:** https://qovaryx.jehorizon.com - **Download (desktop beta):** https://qovaryx.jehorizon.com/download.html - **Research devlog:** https://qovaryx.jehorizon.com/research - **Community Discord:** https://discord.gg/PtuHZDv5ju - **Ko-fi (we cover GPU bills):** https://ko-fi.com/tjarvis91 - **Open research repo:** https://github.com/thron-j/qovaryx-ai-research If you find a failure mode this card doesn't cover, open a discussion or come to the Discord — that's how the next crystal corpus gets written.