---
license: apache-2.0
language:
- en
base_model: tjarvis91/qovaryx-50m-scratch-base
base_model_relation: finetune
library_name: pytorch
pipeline_tag: text-generation
tags:
- text-generation
- rag
- retrieval-augmented-generation
- reranker
- relevance-scoring
- sovereign-base
- on-device
- edge-ai
- qovaryx
- compact-cognition
- local-ai
- cross-domain
- factual-grounding
- anti-hallucination
- 50m
- bpe
- english
- small-language-model
- slm
- beir
---

# Q-RAG-50M-Sovereign — the sovereign retrieval head that punches above its weight

> **A 50M-parameter relevance scorer that beats BGE-reranker-large (560M, 11× larger) on in-distribution refusal and ties or beats 4 of 11 tested rerankers/embeddings on out-of-distribution BEIR — at 50M params, on CPU, fully sovereign.**

## What this model does, in one sentence

Given a USER query and a CANDIDATE passage, Q-RAG outputs **exactly one character** — `1` if the passage is relevant to the query, `0` if it is not — making it a drop-in relevance filter for any RAG (retrieval-augmented generation) pipeline.

## Headline: where Q-RAG wins, where it loses, why both matter

### In-distribution (10-domain Q-RAG holdout, 30 rows): **#1 of 11**

Q-RAG was trained on **cross-domain refusal as a first-class objective** — every query paired with both same-domain near-miss adversaries and cross-domain off-topic passages. On the holdout that tests exactly this, **Q-RAG beats every model we evaluated**, including BGE-reranker-large (560M) and BGE-reranker-v2-m3 (568M) — 11× our parameter count.

| Rank | Model | Params | Acc | Carry-12 | Cross-18 |
|---:|---|---:|---:|---:|---:|
| **1** | **Q-RAG-50M-Sovereign** | **50M** | **100.0%** | **100.0%** | **100.0%** |
| 2 | bge-reranker-large | 560M | 96.7% | 100.0% | 94.4% |
| 2 | bge-reranker-v2-m3 | 568M | 96.7% | 100.0% | 94.4% |
| 4 | ms-marco-MiniLM-L-6-v2 | 23M | 93.3% | 100.0% | 88.9% |
| 4 | ms-marco-MiniLM-L-12-v2 | 33M | 93.3% | 100.0% | 88.9% |
| 4 | mxbai-rerank-xsmall-v1 | 70M | 93.3% | 100.0% | 88.9% |
| 4 | gte-reranker-modernbert-base | 149M | 93.3% | 100.0% | 88.9% |
| 8 | e5-small-v2 | 33M | 90.0% | 100.0% | 83.3% |
| 8 | bge-reranker-base | 278M | 90.0% | 100.0% | 83.3% |
| 10 | bge-small-en-v1.5 | 33M | 86.7% | 100.0% | 77.8% |
| 10 | bge-m3 | 568M | 86.7% | 91.7% | 83.3% |

All baselines are at their *oracle* threshold (the threshold chosen to maximize their accuracy on the full holdout — a generous upper bound). Q-RAG outputs `1` or `0` directly with no threshold to tune.

### Out-of-distribution (BEIR NFCorpus + SciFact slice, 250 rows): **rank 9 of 12** — but the gap is tiny

We also tested on **BEIR**, a public IR benchmark. The slice combines NFCorpus (medical literature retrieval) and SciFact (scientific claim verification) — domains Q-RAG was **not** trained on. 25 queries each, 1 positive + 4 hard negatives per query.

| Rank | Model | Params | BEIR Acc | Lat (ms) |
|---:|---|---:|---:|---:|
| 1 | bge-small-en-v1.5 | 33M | 93.2% | 38 |
| 2 | ms-marco-MiniLM-L-6-v2 | 23M | 92.4% | 19 |
| 2 | gte-reranker-modernbert-base | 149M | 92.4% | 147 |
| 4 | e5-small-v2 | 33M | 92.0% | 37 |
| 5 | bge-reranker-v2-m3 | 568M | 90.8% | 391 |
| 5 | bge-m3 | 568M | 90.8% | 396 |
| 7 | ms-marco-MiniLM-L-12-v2 | 33M | 90.4% | 38 |
| 7 | bge-reranker-base | 278M | 90.4% | 119 |
| **9** | **Q-RAG-50M-Sovereign** | **50M** | **89.6%** | **168** |
| 9 | mxbai-rerank-xsmall-v1 | 70M | 89.6% | 919 |
| 11 | bge-reranker-large | 560M | 88.4% | 392 |

**Honest reading.** On medical+scientific OOD, Q-RAG **lands rank 9 of 12** at 89.6%. But the field is tight: only **3.6 points** separate the leader (bge-small-en-v1.5 at 93.2%) from Q-RAG, and Q-RAG **outright beats BGE-reranker-large (560M, 11× larger)** by 1.2 points and **ties mxbai-rerank-xsmall**. Models like BGE-reranker-v2-m3 and bge-m3 (568M) finish only 1.2 points ahead of us at over 10× the size.

Models with 11× our parameters are not 11× better at this task — the curve flattens hard. That's what "punching above your weight" looks like: a 50M model trading punches with 560M-parameter rerankers on data it wasn't even trained on, while still being **#1 on the data it was trained for**.

## How Q-RAG punches above its weight

Three technical choices, applied together, produce the result above. None are individually novel; the combination is what works at 50M params.

### 1. Cross-domain refusal as a **first-class training objective**, not a side effect

Most retrieval models — embeddings and rerankers alike — are trained on positive ranking signal (MS MARCO click-through, NLI entailment, etc.). They learn what "more relevant" looks like, then hope the threshold separates the relevant from the irrelevant.

Q-RAG was trained explicitly on **cross-domain off-topic refusal** — every query in the corpus was paired against 5 passages drawn from *other* domains, labeled `0`, and weighted **higher** than the positives during the loss computation. The model learned that the default answer for "wrong domain" is *refuse*, not *score it low and hope the threshold catches it*. The result: 100% on the cross-domain refusal subset, where bge-m3 (568M) drops to 83.3%.

### 2. Adversarial **same-domain near-miss negatives**

The hardest failure for an embedding model is a same-shape-but-wrong-specific-answer passage. "Paris is the capital of France" sits near "Berlin is the capital of Germany" in embedding space — same sentence structure, same topic family, same vocabulary register. The cosine similarity says yes; relevance says no.

For every topic in training, Q-RAG sees 4–6 same-domain wrong-specific-answer passages weighted **even higher** than the positives. The model learned the *shape* of "wrong-but-shaped-right" and refuses cleanly. This is the failure mode that drives most production RAG hallucinations.

### 3. **Binary token output**, not a score

Embedding models output a vector; you compare via cosine and choose a threshold. Rerankers output a logit; you choose a threshold. Both leave the calibration as the operator's problem — and the right threshold depends on the domain, the retriever upstream, and the size of the candidate set.

Q-RAG outputs **a single token**: `1` or `0`. No threshold to tune. No calibration per pipeline. Drop it in after your dense retriever; pass through every passage that scores `1`; refuse if none do. The training objective is binary cross-entropy on that exact token; the inference path is a single argmax on the next-token distribution. No magic.

The result is a small, fast head you put **after** your dense retriever to filter relevant passages before paying token cost on a 7B+ answer model.

## Are we new? Yes — and we trained from a sovereign base

Q-RAG is **53.5M parameters** and was full-fine-tuned from [`tjarvis91/qovaryx-50m-scratch-base`](https://huggingface.co/tjarvis91/qovaryx-50m-scratch-base) — a base we pretrained ourselves from random initialization on 491.5M tokens with our own BPE tokenizer (`english_v1`, vocab 32000).

**Not** SmolLM2. **Not** Qwen. **Not** Llama. **Not** Mistral. **Not** Phi. No borrowed foundation model. No closed-source weights. Every parameter traces back to a Qovaryx training run on Qovaryx hardware.

That matters for two reasons:
1. **No license entanglement** — Apache 2.0 all the way down, full audit trail in this repo.
2. **No baked-in priors from someone else's training set** — when we say Q-RAG was trained on cross-domain refusal, we mean it didn't see the BEIR test set or anything contaminated with it during base pretraining either.

## What problem this actually solves

You're already running RAG. Your dense retriever returns top-k passages. Some are relevant. Some are not. You don't want to pay for an LLM call on the not-relevant ones, and you don't want them in the answer model's context wasting attention. Q-RAG is the relevance filter between retrieve and generate.

| Step | What you had | What Q-RAG adds |
|---|---|---|
| 1. Retrieve top-k passages | dense embedding model | (unchanged) |
| **2. Filter for relevance** | — usually skipped | **Q-RAG: 1 forward pass per passage, output 1 or 0** |
| 3. Generate answer | big LLM with all k passages | big LLM with only the relevant ones |

Pipeline impact:
- **Cheaper** — generation cost only on relevant passages.
- **More accurate** — fewer red-herring passages in the answer model's context.
- **More refusable** — if Q-RAG drops every passage, the system knows to say "I don't have evidence to answer that" instead of hallucinating.

## How to load it (Python)

```python
import torch
from tokenizers import Tokenizer
from bleeding_edge.model.decoder import FinanceDecoder, DecoderConfig

tok = Tokenizer.from_file("tokenizer.json")
ckpt = torch.load("pytorch_model.pt", map_location="cpu", weights_only=False)
cfg = DecoderConfig(**{k: v for k, v in ckpt["model_cfg"].items() if k in DecoderConfig.__dataclass_fields__})
cfg.vocab_size = tok.get_vocab_size()
model = FinanceDecoder(cfg).eval()
state = {k.removeprefix("_orig_mod."): v for k, v in ckpt["model_state"].items()}
model.load_state_dict(state, strict=False)

SYSTEM = (
    "You are Q-Retriever. Given a USER query and a CANDIDATE passage, "
    "decide whether the passage is relevant to the query. "
    "Output exactly one character: 1 if relevant, 0 if not relevant. "
    "Refuse to invent relevance: if the passage does not address the query, output 0."
)

def score(query: str, passage: str) -> int:
    prompt = f"{SYSTEM}\n\nUSER: Q: {query}\n\nPASSAGE:\n{passage}\n\nASSISTANT: "
    ids = tok.encode(prompt).ids
    cur = torch.tensor([ids], dtype=torch.long)
    with torch.no_grad():
        nxt = int(torch.argmax(model(cur, return_decision=False).logits[:, -1, :], dim=-1))
    return 1 if tok.decode([nxt]).strip() == "1" else 0

print(score("capital of Germany", "Berlin is the capital of Germany."))  # 1
print(score("capital of Germany", "Paris is the capital of France."))     # 0
print(score("how to git commit", "The Nile is the longest river."))       # 0
```

## Architecture (Qovaryx proprietary FinanceDecoder)

- 53.5M parameters
- 12 decoder blocks, d_model = 512, n_head = 8, GQA n_kv_head = 2
- SwiGLU FFN, RoPE positional, RMSNorm
- Multi-token prediction (MTP) auxiliary heads
- Decision head for routed-decision tasks
- Tokenizer: Qovaryx `english_v1` BPE, vocab 32000 (in-house)
- Pretrained from `qovaryx-50m-scratch-base` step 60000 → 491.5M tokens
- Full fine-tune (no LoRA, no QLoRA, no adapter): every parameter was updated on the Qovaryx Q-RAG crystal corpus

## What this model is NOT

- **Not a sentence embedding model.** No vector output. Use it *after* your dense retriever, not instead.
- **Not a general-purpose chatbot.** Free-text generation outside the relevance-scoring task surface will degrade.
- **Not the top BEIR scorer** — bge-small-en-v1.5 is 3.6 points ahead on BEIR. If your retrieval is exclusively medical/scientific OOD, run that baseline.
- **Not reproducible from this card.** Weights, holdouts, and benchmark numbers are public; the crystal corpus generator and training hyperparameters are not.

## License & posture

Apache 2.0 for the published weights, model card, holdouts, and benchmark JSONs.

The Qovaryx scratch base build pipeline, the Q-RAG crystal corpus generator, the eval gate constants, the cluster routing policy, and the protected runtime entrypoint are **Qovaryx proprietary technology** and are not included.

## Reproduction & artifacts in this repo

- `pytorch_model.pt` — Q-RAG weights (v10, 205 MB)
- `tokenizer.json` — Qovaryx english_v1 BPE
- `config.json` — model config
- `holdout_eval.json` — full per-row in-house holdout result (30/30 = 100%)
- `benchmark_vs_embeddings.json` — in-house holdout vs 10 baselines (Q-RAG #1)
- `benchmark_beir.json` — BEIR NFCorpus+SciFact slice vs same baselines
- Reproduction scripts: `scripts/benchmark_q_rag_vs_embeddings.py` and `scripts/benchmark_q_rag_vs_rerankers_beir.py` in the upstream research repo

## Sibling specialists in the Qovaryx Compact Specialist Suite

All ten specialists share the `qovaryx-50m-scratch-base` and the same audit discipline. Use one directly; use all ten through the cluster shell.

- [Q-Triage](https://huggingface.co/tjarvis91/Q-Triage-50M-Sovereign) — ticket routing
- [Q-DocCite](https://huggingface.co/tjarvis91/Q-DocCite-50M-Sovereign) — document citation
- [Q-Invoice](https://huggingface.co/tjarvis91/Q-Invoice-50M-Sovereign) — invoice extraction
- [Q-ToolCall](https://huggingface.co/tjarvis91/Q-ToolCall-50M-Sovereign) — agent tool-calls
- [Q-Meeting](https://huggingface.co/tjarvis91/Q-Meeting-50M-Sovereign) — meeting structuring
- [Q-FinCite](https://huggingface.co/tjarvis91/Q-FinCite-50M-Sovereign) — 10-K/10-Q citation
- [Q-CmdSafe](https://huggingface.co/tjarvis91/Q-CmdSafe-50M-Sovereign) — command safety triage
- [Q-SheetExtract](https://huggingface.co/tjarvis91/Q-SheetExtract-50M-Sovereign) — spreadsheet extraction
- [Q-Coder](https://huggingface.co/tjarvis91/Q-Coder-50M-Sovereign) — Python code skeletons
- **Q-RAG (this model)** — relevance filter for RAG

## Reproduction invitation

If you run Q-RAG against a model not in our table — Cohere Rerank, Voyage Rerank, jina-reranker-v2, ColBERT, or anything else — please open a discussion on this repo with the numbers. We'll add it to the card, honestly, whichever direction the result falls. The benchmark script + holdouts are in this repo.

## Official site & community

The full Qovaryx runtime that orchestrates this specialist alongside the other nine ships from:

- **Site:** https://qovaryx.jehorizon.com
- **Download (desktop beta):** https://qovaryx.jehorizon.com/download.html
- **Research devlog:** https://qovaryx.jehorizon.com/research
- **Community Discord:** https://discord.gg/PtuHZDv5ju
- **Ko-fi (we cover GPU bills):** https://ko-fi.com/tjarvis91
- **Open research repo:** https://github.com/thron-j/qovaryx-ai-research

If you find a failure mode this card doesn't cover, open a discussion or come to the Discord — that's how the next crystal corpus gets written.