---
license: cc-by-nc-4.0
language:
- en
tags:
- embeddings
- dense-retrieval
- matryoshka
- rag
- agents
- mteb
- sentence-similarity
- semantic-search
- text-embeddings
- text-embedding
- vector-search
- document-retrieval
- similarity-search
- classification
- clustering
- edge-ai
- on-device
- local-inference
- efficient-ai
- rag-retrieval
library_name: ogma
metrics:
- mteb
model-index:
- name: axiotic/ogma-large
  results:
  - task:
      type: sts
    dataset:
      name: MTEB STSBenchmark
      type: mteb/stsbenchmark-sts
      split: test
      revision: b0fddb56ed78048fa8b90373c8a3cfc37b684831
    metrics:
    - type: cosine_spearman
      value: 83.7
  - task:
      type: classification
    dataset:
      name: MTEB AmazonPolarityClassification
      type: mteb/amazon_polarity
      split: test
    metrics:
    - type: accuracy
      value: 68.6
  - task:
      type: clustering
    dataset:
      name: MTEB RedditClustering
      type: mteb/reddit-clustering
      split: test
    metrics:
    - type: v_measure
      value: 41.6
  - task:
      type: pair-classification
    dataset:
      name: MTEB TwitterSemEval2015
      type: mteb/twittersemeval2015-pairclassification
      split: test
    metrics:
    - type: cos_sim_ap
      value: 84.0
  - task:
      type: reranking
    dataset:
      name: MTEB MindSmallReranking
      type: mteb/mind_small
      split: validation
    metrics:
    - type: map
      value: 53.1
  - task:
      type: retrieval
    dataset:
      name: MTEB MSMARCO
      type: mteb/msmarco
      split: dev
    metrics:
    - type: ndcg_at_10
      value: 43.7
  - task:
      type: summarization
    dataset:
      name: MTEB SummEval
      type: mteb/summeval
      split: test
    metrics:
    - type: cos_sim_spearman
      value: 30.9
pipeline_tag: sentence-similarity
---

# ogma-large &nbsp;·&nbsp; 32.4M text embedding model &nbsp;·&nbsp; MTEB 57.41

> English text embedding model for semantic search, RAG, vector search, retrieval, clustering, classification, STS, and agent memory — MTEB 57.41, 32.4M parameters, 1024-token context

**Ogma Large** is the highest-quality model in the Ogma family, built for applications where accuracy matters most. At 32.4M parameters it competes directly with Potion-32M while delivering 4× longer context (1024 vs 256 tokens), asymmetric query/document encoding, and Matryoshka embeddings for flexible dimensionality.

## Why the name Ogma?

Ogma is named after **Ogma** (also written Oghma), the Irish god associated with eloquence and credited in myth with inventing **Ogham**, an early alphabet for encoding language into symbols. That is the core job of an embedding model: turn language into compact vectors that machines can search, compare, cluster, and reason over.

---

## Use cases

ogma-large is designed for **semantic search**, **RAG retrieval**, **agent memory**, **vector databases**, **document retrieval**, **text classification**, **clustering**, **STS / sentence similarity**, and retrieval-heavy pipelines where a wider Ogma model is useful. For most production RAG and agent workloads, start with **ogma-base** or **ogma-small** for better efficiency.

Good fits:

- **Quality-first retrieval** where the extra model width is worth the larger footprint.
- **Server-side embedding services** for high-quality document search, knowledge-base retrieval, and agent memory.
- **Evaluation and ablation work** where the strongest Ogma checkpoint is useful as a reference model.
- **Private/local deployments** where data should stay inside the application or organisation, but on-device footprint is less important.
- **Classification and clustering pipelines** where representation quality matters more than minimum latency.

Choose ogma-large when accuracy matters more than edge footprint. Choose **ogma-small**, **ogma-mini**, or **ogma-micro** for on-device, browser, serverless, or ultra-efficient applications.

---

## Highlights

- 🏆 **MTEB avg 57.41** on the standard 54-task English benchmark
- 📏 **1024-token context** — 4× longer than all-MiniLM-L6-v2 (256 tokens)
- 🔀 **Asymmetric encoding** via task tokens: `[QRY]`, `[DOC]`, `[SYM]`
- 📐 **Matryoshka dims**: [256, 128, 64, 32] — one model, any precision
- 🛡️ **+4.0% F1** on prompt injection detection vs MiniLM (same architecture series)

---

## Performance

### MTEB English — 54/54 tasks evaluated (category-averaged)

Benchmarked with [MTEB](https://github.com/embeddings-benchmark/mteb) v2.10.7 on the standard 54-task English benchmark using category averaging (same methodology as the MTEB leaderboard).

| Category | ogma-large | all-MiniLM-L6-v2 | Δ vs MiniLM |
|---|---|---|---|
| Classification | **68.6** | 62.62 | +5.98 |
| Clustering | **41.6** | 41.94 | -0.34 |
| PairClassification | **84.0** | 82.37 | +1.63 |
| Reranking | **53.1** | 58.04 | -4.94 |
| Retrieval | **43.7** | 41.95 | +1.75 |
| STS | **83.7** | 78.90 | +4.80 |
| Summarization | **30.9** | 30.81 | +0.09 |
| **Overall** | **57.41** | *56.09* | **+1.32** |

> MiniLM and Potion reference scores from the [Model2Vec results page](https://github.com/MinishLab/model2vec/blob/main/results/README.md).

### Why choose Ogma Large?

ogma-large delivers the highest raw quality in the family with its wider 512-dim internal representation. For most RAG and agent workloads, **ogma-base offers better efficiency** at similar or higher MTEB scores.

### CPU Inference Benchmark

Benchmarked on AMD Ryzen Threadripper PRO 3955WX (16-core/32-thread), PyTorch 2.10, batch of 100 mixed-length documents.

| Model | Params | 1T·bs1 (docs/s) | 1T·bs1 latency | 1T·bs32 (docs/s) | 16T·bs32 (docs/s) |
|---|---|---|---|---|---|
| potion-base-8M | 7.6M | 6,892 | 0.14 ms | 18,021 | 17,040 |
| potion-base-32M | 32.3M | 6,826 | 0.15 ms | 17,984 | 17,328 |
| **ogma-small** | **8.6M** | **92.9** | **10.8 ms** | **60.9** | **255.6** |
| all-MiniLM-L6-v2 | 22.7M | 53.1 | 18.8 ms | 40.5 | 227.9 |
| **ogma-base** | **13.3M** | **48.3** | **20.7 ms** | **28.9** | **121.6** |
| bge-small-en-v1.5 | 33.4M | 26.8 | 37.3 ms | 19.8 | 115.3 |
| ogma-large | 32.4M | 22.1 | 45.2 ms | 12.5 | 67.3 |
| bge-base-en-v1.5 | 109.5M | 7.6 | 131.7 ms | 4.8 | 30.2 |

> Potion models are static (lookup-based); their near-zero inference cost is the trade-off for no contextual understanding and fixed 256-token context. Transformer models like Ogma and MiniLM understand context. **ogma-small is 1.75× faster** than MiniLM single-threaded and **1.12× faster** batched.

### Safety — Toxicity & Prompt Injection Detection

Evaluated on the Ogma transformer architecture (same family). Embeddings are extracted then fed to a logistic regression (LR) or MLP classifier head — the embedding model itself is not fine-tuned. Evaluated against `all-MiniLM-L6-v2` as baseline.

#### 1. Jigsaw Toxic Comment Classification

**Dataset:** `Arsive/toxicity_classification_jigsaw` — Binary toxicity classification
**Train:** 25,960 · **Test:** 6,490

| Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC |
|---|---|---|---|---|---|---|
| **Ogma** | LogReg | 89.12% | **88.26%** | 89.09% | 87.44% | 95.74% |
| **Ogma** | MLP | 88.91% | 87.98% | 89.14% | 86.85% | 95.92% |
| MiniLM | LogReg | 87.32% | 86.25% | 87.46% | 85.07% | 94.96% |
| MiniLM | MLP | 91.71% | 91.24% | 90.13% | 92.39% | **97.16%** |

Ogma (LR) leads MiniLM (LR) by **+2.01% F1**. MiniLM (MLP) leads on this dataset — the additional training data (25K samples) allows the MLP to compensate for MiniLM's slightly weaker base representations.

#### 2. Prompt Injection Detection — deepset/prompt-injections

**Dataset:** `deepset/prompt-injections` — Binary injection detection
**Train:** 546 · **Test:** 116 (low-data regime)

| Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC |
|---|---|---|---|---|---|---|
| **Ogma** | LogReg | 86.21% | 84.62% | **100.0%** | 73.33% | **97.77%** |
| **Ogma** | MLP | **90.52%** | **90.27%** | 96.23% | 85.0% | 98.1% |
| MiniLM | LogReg | 82.76% | 80.39% | 97.62% | 68.33% | 94.52% |
| MiniLM | MLP | 87.07% | 86.24% | 95.92% | 78.33% | 93.96% |

Ogma leads across both classifiers: **+4.03% F1 (MLP)**, **+4.23% F1 (LogReg)**. Ogma's representations are better separated in the low-data regime — it achieves 100% precision with LogReg, meaning zero false positives.

#### 3. Prompt Injection Detection — neuralchemy/Prompt-injection-dataset

**Dataset:** `neuralchemy/Prompt-injection-dataset` — Binary injection detection
**Train:** 4,391 · **Test:** 942

| Model | Classifier | Accuracy | F1 | Precision | Recall | AUC-ROC |
|---|---|---|---|---|---|---|
| **Ogma** | LogReg | 95.22% | 95.93% | 95.84% | **96.01%** | **99.30%** |
| **Ogma** | MLP | **95.44%** | **96.16%** | 94.89% | 97.46% | **99.37%** |
| MiniLM | LogReg | 94.59% | 95.38% | 95.46% | 95.29% | 98.92% |
| MiniLM | MLP | 93.95% | 94.85% | 94.59% | 95.11% | 98.92% |

Ogma leads across all metrics: **+0.78% F1 (MLP)**, **+0.55% F1 (LR)**. Both models perform well at scale; Ogma maintains its edge and achieves higher AUC-ROC (99.37% vs 98.92%).

#### Summary

| Task | Ogma best F1 | MiniLM best F1 | Δ |
|---|---|---|---|
| Jigsaw Toxicity | 88.26% (LR) | 91.24% (MLP) | −2.98% |
| deepset Injection | **90.27% (MLP)** | 86.24% (MLP) | **+4.03%** |
| neuralchemy Injection | **96.16% (MLP)** | 95.38% (LR) | **+0.78%** |

Ogma is a stronger feature extractor for **prompt injection detection** — the safety-critical task for agent pipelines. MiniLM edges ahead on toxicity when given sufficient labelled data and a more powerful classifier head. For agentic use cases where detecting adversarial instructions is the priority, Ogma representations are the better choice.
---

## Architecture

| Property | Value |
|---|---|
| Architecture | Custom Transformer |
| Internal dim (`d_model`) | 512 |
| Output dim (`d_output`) | 256 |
| Transformer layers | 9 |
| Attention heads | 8 |
| Vocabulary | 30,000 (SentencePiece / AlbertTokenizer) |
| Max sequence length | **1,024 tokens** |
| Pooling | Mean pooling |
| Task tokens | `[QRY]` (query), `[DOC]` (document), `[SYM]` (symmetric) |
| Matryoshka dims | [32, 64, 128, 256] |
| Output normalisation | L2 (unit sphere) |
| Parameters | 32.4M |
| Model file | `model.safetensors` (124 MB) |

**Key design choices:**

- **Task token prepend:** A learnable task token (`[QRY]`, `[DOC]`, or `[SYM]`) is prepended to the input sequence before the transformer. This enables true asymmetric encoding in a single model with a single forward pass.
- **Matryoshka training:** The model is trained with Matryoshka Representation Learning, meaning embeddings truncated to any supported sub-dimension remain well-calibrated without retraining.
- **Mean pooling:** The average of all token outputs (excluding padding) produces the sentence embedding, which consistently outperforms CLS-token pooling in the Ogma architecture family.
- **L2 normalisation:** All outputs are unit-normalised; cosine similarity == dot product == euclidean similarity (up to a constant), simplifying downstream usage.

---

## Usage

### Installation

```bash
pip install torch tokenizers huggingface_hub pyyaml
```

### Basic Encoding

```python
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
import sys, torch

# 1. Download model files
model_path = snapshot_download("axiotic/ogma-large")

# 2. Load model (bundled source code)
sys.path.insert(0, model_path)
from ogma_model import OgmaModel
model = OgmaModel.from_checkpoint(model_path, device="cpu")
model.eval()

# 3. Tokenizer
N_SPECIAL = 7
_tok = Tokenizer.from_file(f"{model_path}/tokenizer.json")

def encode(texts: list, max_length: int = 1024):
    all_ids = []
    for text in texts:
        enc = _tok.encode(text)
        ids, toks = enc.ids, enc.tokens
        # Strip CLS/SEP added by tokenizer
        if toks and toks[0] in ("[CLS]", "<s>"):
            ids, toks = ids[1:], toks[1:]
        if toks and toks[-1] in ("[SEP]", "</s>"):
            ids = ids[:-1]
        # Shift into Ogma's vocabulary space and add BOS/EOS
        ogma_ids = [2] + [rid + N_SPECIAL for rid in ids] + [3]
        all_ids.append(ogma_ids[:max_length])

    ml = max(len(ids) for ids in all_ids)
    token_ids = torch.zeros(len(texts), ml, dtype=torch.long)
    attn_mask = torch.zeros(len(texts), ml, dtype=torch.long)
    for i, ids in enumerate(all_ids):
        token_ids[i, :len(ids)] = torch.tensor(ids)
        attn_mask[i, :len(ids)] = 1
    return token_ids, attn_mask

# 4. Encode (symmetric mode — good for clustering, classification, STS)
from config import TaskToken

sentences = [
    "The quick brown fox jumps over the lazy dog",
    "A fast auburn vulpine leaps over an idle canine",
]
with torch.no_grad():
    token_ids, attn_mask = encode(sentences)
    embeddings = model.encode(token_ids, attn_mask, task=TaskToken.SYM)

print(embeddings.shape)  # (256,)
sim = (embeddings[0] @ embeddings[1]).item()
print(f"Cosine similarity: {sim:.4f}")  # L2-normalised, dot product = cosine
```

### Asymmetric Retrieval (Query / Document)

Use `TaskToken.QRY` for query embeddings and `TaskToken.DOC` for document embeddings in retrieval pipelines. This asymmetric encoding is a first-class feature of the Ogma architecture.

```python
# Asymmetric retrieval — encode queries with QRY, passages with DOC
from config import TaskToken

queries = [
    "What is knowledge distillation?",
    "How does retrieval-augmented generation work?",
]
documents = [
    "Knowledge distillation trains a smaller student model to mimic a larger teacher...",
    "Retrieval-Augmented Generation (RAG) combines a dense retriever with a language model...",
]

with torch.no_grad():
    q_ids, q_mask = encode(queries)
    d_ids, d_mask = encode(documents)
    q_emb = model.encode(q_ids, q_mask, task=TaskToken.QRY)  # (N, 256)
    d_emb = model.encode(d_ids, d_mask, task=TaskToken.DOC)  # (M, 256)

# Dot product == cosine similarity (embeddings are L2-normalised)
scores = q_emb @ d_emb.T  # (N, M)
print(scores)
```

### Matryoshka — Flexible Dimensionality

Ogma supports Matryoshka Representation Learning. Truncate and re-normalise to any supported sub-dimension for faster indexing or lower memory usage — no retraining required.

```python
import torch.nn.functional as F

with torch.no_grad():
    token_ids, attn_mask = encode(sentences)
    emb_full = model.encode(token_ids, attn_mask)  # (256d, full precision)

# Truncate to any supported sub-dimension and re-normalise — no retraining needed
# Supported dims: [32, 64, 128, 256]
emb_32  = torch.nn.functional.normalize(emb_full[:, :32],  dim=-1)
emb_64  = torch.nn.functional.normalize(emb_full[:, :64],  dim=-1)
emb_128  = torch.nn.functional.normalize(emb_full[:, :128],  dim=-1)
```

### LangChain Integration

```python
# LangChain integration (custom embeddings class)
from langchain.embeddings.base import Embeddings
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
from config import TaskToken
import sys, torch

class OgmaEmbeddings(Embeddings):
    def __init__(self, model_name: str = "axiotic/ogma-large", device: str = "cpu"):
        model_path = snapshot_download(model_name)
        sys.path.insert(0, model_path)
        from ogma_model import OgmaModel
        self.model = OgmaModel.from_checkpoint(model_path, device=device)
        self.model.eval()
        self._tok = Tokenizer.from_file(f"{model_path}/tokenizer.json")
        self._device = device

    def _encode(self, texts, task=TaskToken.SYM):
        # (encode function from Basic Usage above)
        from your_module import encode  # or inline the encode function
        with torch.no_grad():
            ids, mask = encode(texts)
            return self.model.encode(ids.to(self._device), mask.to(self._device), task=task)

    def embed_documents(self, texts):
        return self._encode(texts, task=TaskToken.DOC).cpu().numpy().tolist()

    def embed_query(self, text):
        return self._encode([text], task=TaskToken.QRY).cpu().numpy()[0].tolist()

embeddings = OgmaEmbeddings()
```

---

## Model Family

| Model | Params | Size | MTEB Avg | Class | Clust | PairClass | Rerank | Ret | STS | Summ | d_out | Context |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| **[ogma-large](https://huggingface.co/axiotic/ogma-large)** | 32.4M | 124 MB | **57.41** | 68.6 | 41.6 | 84.0 | 53.1 | 43.7 | 83.7 | 30.9 | 256 | 1024 |
| **[ogma-base](https://huggingface.co/axiotic/ogma-base)** | 13.3M | 51 MB | **57.04** | 67.89 | 41.49 | 83.73 | 51.25 | 42.36 | 82.84 | 29.73 | 256 | 1024 |
| **[ogma-small](https://huggingface.co/axiotic/ogma-small)** | 8.6M | 33 MB | 56.34 | 66.67 | 40.69 | 82.91 | 50.51 | 42.05 | 82.00 | 29.59 | 256 | 1024 |
| **[ogma-mini](https://huggingface.co/axiotic/ogma-mini)** | 3.5M | 14 MB | 53.07 | 61.80 | 37.38 | 79.66 | 47.39 | 36.21 | 77.71 | 31.33 | 256 | 1024 |
| **[ogma-micro](https://huggingface.co/axiotic/ogma-micro)** | 2.3M | 8.9 MB | 52.19 | 59.57 | 36.88 | 78.62 | 49.74 | 33.09 | 75.63 | 31.77 | 128 | 1024 |
| *all-MiniLM-L6-v2* | 22.7M | 87 MB | *56.09* | 62.62 | 41.94 | 82.37 | 58.04 | 41.95 | 78.90 | 30.81 | 384 | 256 |
| *potion-base-32M* | 32.3M | 123 MB | *51.66* | 65.97 | 35.29 | 78.17 | 50.92 | 33.52 | 74.22 | 29.78 | 256 | inf |
| *potion-base-8M* | 7.6M | 29 MB | *50.03* | 64.44 | 32.93 | 76.62 | 49.73 | 31.71 | 73.24 | 29.28 | 256 | inf |

All Ogma: MTEB 2.10.7, 54-task standard English set, category-averaged.
MiniLM/Potion: published scores from [Model2Vec results page](https://github.com/MinishLab/model2vec/blob/main/results/README.md).

---

## Training Details

| Property | Value |
|---|---|
| Teacher model | `jinaai/jina-embeddings-v5-text-small` (CC-BY-NC-4.0) |
| Training paradigm | Knowledge distillation from cached teacher embeddings |
| Training data | ~7M curated English sentence pairs |
| Tokenizer | AlbertTokenizer (SentencePiece, vocab=30,000) |
| Embedding initialisation | PCA of teacher embeddings (128d) projected to d_model |
| Loss | Distillation + contrastive (balanced schedule) |
| Evaluation framework | MTEB 2.10.7 |

---

## Limitations

- **No text generation.** Ogma is an encoder-only embedding model.
- **English only.** Training data and evaluation are English-only.
- **Slower than static models.** Transformer inference is 40-100× slower than static models (Potion, Model2Vec) on CPU. The trade-off: contextual understanding and 4× longer sequences.
- **Non-commercial licence.** Due to distillation from a CC-BY-NC-4.0 teacher, Ogma inherits the NonCommercial restriction. Commercial use requires a separate Jina AI licence or retraining with a permissive teacher (Apache 2.0 compatible models like BGE or E5 can substitute at the cost of a full retraining run).
- **Reranking gap.** Ogma lags behind MiniLM-L6-v2 on reranking tasks (category avg delta: -4.9). This is an architectural characteristic: the model optimises for semantic similarity and classification over pairwise ranking.

---

## Licence & Attribution

This model is released under **[CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/)** (Creative Commons Attribution-NonCommercial 4.0 International).

**Required attribution (must be included in all uses):**

> This model was trained via knowledge distillation from
> `jina-embeddings-v5-text-small` (https://huggingface.co/jinaai/jina-embeddings-v5-text-small)
> by Jina AI, licensed under CC-BY-NC-4.0.

---

## Citation

```bibtex
@misc{ogma2026,
  title     = {Ogma: Efficient Dense Retrieval via Structured Embeddings},
  author    = {Axiotic AI},
  year      = {2026},
  url       = {https://huggingface.co/axiotic/ogma-large},
}
```