---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- onnx
- onnxruntime
- ai-security
- duplicate-detection
- jailbreak-detection
language: multilingual
pipeline_tag: sentence-similarity
library_name: onnx
---

# jailbreak-embeddings-base-onnx

ONNX export of the `multilingual-e5-base-wjb-threatfeed_v1` model — a fine-tuned [sentence-transformers](https://www.SBERT.net) model for detecting duplicate vulnerability submissions (jailbreak and prompt injection attacks) in the 0din threat feed.

It maps prompts to a 768-dimensional dense vector space optimized for semantic similarity comparison of attack prompts.

This model achieves a **+50.6% F1 improvement** over the OpenAI `text-embedding-3-large` baseline on duplicate detection.

## Model Details

### Model Description

- **Model Type:** Sentence Transformer (two-stage fine-tuned), exported to ONNX
- **Base Model:** [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) (~278M parameters)
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
- **Language:** Multilingual (XLM-RoBERTa backbone)
- **Format:** ONNX (compatible with onnxruntime, tract-onnx, and other ONNX runtimes)

### Embedding Pipeline

```
Input Text → Tokenizer → ONNX Model → Mean Pooling → L2 Normalization → Embedding
```

The ONNX model contains only the transformer backbone. Mean pooling and L2 normalization must be implemented in application code (see usage examples below).

### Model Inputs

The ONNX model requires 3 inputs:
- `input_ids`: Token IDs from tokenizer
- `attention_mask`: 1 for real tokens, 0 for padding
- `token_type_ids`: All zeros for single-sentence embeddings

### ONNX Verification

The ONNX export produces **bit-for-bit identical** embeddings to the native sentence-transformers model (0.000000 max difference across all test sentences).

## Intended Use

This model is designed for:

- **Duplicate detection** in AI security vulnerability reports (jailbreak/prompt injection attacks)
- **Semantic similarity** comparison of attack prompts that may use different surface-level techniques but target the same underlying vulnerability
- **Embedding generation** for LSH-based similarity search in vulnerability management systems
- **Edge/server deployment** via ONNX runtime without requiring PyTorch

The model is trained to recognize semantic equivalence between attack prompts even when they use different jailbreak tactics (e.g., role-playing, encoding, academic framing) to elicit the same harmful behavior.

## Usage

### sentence-transformers (with ONNX backend)

```python
from sentence_transformers import SentenceTransformer

# Load directly with ONNX backend
model = SentenceTransformer("0dinai/jailbreak-embeddings-base-onnx", backend="onnx")

sentences = ["First attack prompt", "Second attack prompt"]
embeddings = model.encode(sentences)
similarity = model.similarity(embeddings, embeddings)
print(similarity)
```

### Python (onnxruntime)

```python
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

# Load model and tokenizer
session = ort.InferenceSession("onnx/model.onnx")
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=1, pad_token="<pad>")
tokenizer.enable_truncation(max_length=512)

# Tokenize
texts = ["First attack prompt", "Second attack prompt"]
encodings = tokenizer.encode_batch(texts)
input_ids = np.array([e.ids for e in encodings], dtype=np.int64)
attention_mask = np.array([e.attention_mask for e in encodings], dtype=np.int64)
token_type_ids = np.zeros_like(input_ids)

# Run ONNX inference
outputs = session.run(None, {
    "input_ids": input_ids,
    "attention_mask": attention_mask,
    "token_type_ids": token_type_ids,
})
token_embeddings = outputs[0]  # [batch, seq_len, 768]

# Mean pooling
mask = attention_mask[:, :, np.newaxis].astype(np.float32)
embeddings = (token_embeddings * mask).sum(axis=1) / mask.sum(axis=1)

# L2 normalization
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / norms

# Cosine similarity
similarity = np.dot(embeddings[0], embeddings[1])
print(f"Similarity: {similarity:.4f}")
```

### Rust (tract-onnx)

```rust
use tract_onnx::prelude::*;
use tokenizers::Tokenizer;

// Load model and tokenizer
let model = tract_onnx::onnx()
    .model_for_path("onnx/model.onnx")?
    .into_optimized()?
    .into_runnable()?;
let tokenizer = Tokenizer::from_file("tokenizer.json")?;

// Tokenize
let encoding = tokenizer.encode("Attack prompt text", true)?;
let input_ids: Vec<i64> = encoding.get_ids().iter().map(|&x| x as i64).collect();
let attention_mask: Vec<i64> = encoding.get_attention_mask().iter().map(|&x| x as i64).collect();
let token_type_ids: Vec<i64> = vec![0i64; input_ids.len()];

// Run inference, then apply mean pooling + L2 normalization
// (see full Rust implementation at github.com/0din-ai)
```

## Training Details

This model was trained using a **two-stage fine-tuning approach**:

### Stage 1: WildJailbreak Pre-training

Pre-trained on public synthetic data to learn jailbreak semantics.

- **Dataset:** [Allen AI WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) — vanilla-adversarial prompt pairs
- **Pairs:** 161,396 positive pairs (same intent, different formulation)
- **Split:** 153,326 train / 4,034 val / 4,036 test (95% / 2.5% / 2.5%)
- **Loss:** MultipleNegativesRankingLoss (in-batch negatives)
- **Batch size:** 16 (per device) x 2 gradient accumulation steps = 32 effective
- **Learning rate:** 1e-5
- **FP16:** True
- **Purpose:** Teach the model to see through jailbreak wrappers and match prompts by underlying intent

### Stage 2: Threat Feed Fine-tuning

Fine-tuned on annotated pairs from the internal 0din threat feed.

- **Pairs:** 9,598 annotated pairs (7,678 train / 958 val / 962 test)
- **Label Distribution:** ~34% duplicates / ~66% non-duplicates
- **Annotation:** Google Gemini 2.5 Pro (single-model annotation)
- **Source Similarity Threshold:** Candidate pairs generated with Thor similarity >= 0.5
- **Loss:** ContrastiveLoss (cosine distance, margin=0.5)
- **Purpose:** Calibrate the model for real-world duplicate detection on production vulnerability data

#### Stage 2 Hyperparameters

| Parameter | Value |
|-----------|-------|
| Epochs | 50 (early stopped) |
| Batch size | 8 (per device) x 4 gradient accumulation = 32 effective |
| Learning rate | 1e-5 |
| LR scheduler | Linear |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| FP16 | True |
| Early stopping patience | 10 |
| Eval steps | 50 |
| Seed | 1 |
| Best checkpoint | Step 1200 (epoch 5.0) |
| Best validation loss | 0.0149 |

## Evaluation Results

### Duplicate Detection Performance

Evaluated on 55 human-labeled vulnerability pairs (10 duplicates, 45 non-duplicates) from a corpus of 3,749 vulnerabilities. Best F1 score at each model's optimal threshold:

| Model | Best F1 | Threshold | Precision | Recall |
|-------|---------|-----------|-----------|--------|
| OpenAI text-embedding-3-large (baseline) | 0.462 | 0.80 | 1.000 | 0.300 |
| Finetuned V1 (WildJailbreak only, e5-small) | 0.500 | 0.50 | 0.333 | 1.000 |
| Finetuned V2 (WJB + threat feed v1, e5-small) | 0.526 | 0.70 | 0.556 | 0.500 |
| Finetuned V3 (WJB + threat feed v2, e5-small) | 0.556 | 0.75 | 0.625 | 0.500 |
| Finetuned V4 (WJB + threat feed 10k, e5-small) | 0.600 | 0.70 | 0.600 | 0.600 |
| **This model (Base V1)** | **0.696** | **0.70** | **0.615** | **0.800** |

### Threshold Analysis (This Model)

| Threshold | Precision | Recall | F1 | TP | FP | FN | TN |
|-----------|-----------|--------|------|----|----|----|----|
| 0.50 | 0.243 | 0.900 | 0.383 | 9 | 28 | 1 | 17 |
| 0.55 | 0.308 | 0.800 | 0.444 | 8 | 18 | 2 | 27 |
| 0.60 | 0.381 | 0.800 | 0.516 | 8 | 13 | 2 | 32 |
| 0.65 | 0.500 | 0.800 | 0.615 | 8 | 8 | 2 | 37 |
| **0.70** | **0.615** | **0.800** | **0.696** | **8** | **5** | **2** | **40** |
| 0.75 | 0.625 | 0.500 | 0.556 | 5 | 3 | 5 | 42 |
| 0.80 | 0.800 | 0.400 | 0.533 | 4 | 1 | 6 | 44 |
| 0.85 | 1.000 | 0.300 | 0.462 | 3 | 0 | 7 | 45 |
| 0.90 | 1.000 | 0.100 | 0.182 | 1 | 0 | 9 | 45 |

### Key Findings

- **+50.6% F1 improvement** over the OpenAI text-embedding-3-large baseline (0.696 vs 0.462)
- **Largest single jump in the series:** +16% F1 over the e5-small V4 model (0.696 vs 0.600), showing that model capacity matters for this task.
- **Substantially higher recall:** At threshold 0.70, this model achieves 0.800 recall vs 0.600 for e5-small V4, while maintaining comparable precision (0.615 vs 0.600).
- **Wide effective threshold band:** Recall stays at 0.800 across thresholds 0.50–0.70, suggesting the larger model produces more confident and well-separated similarity scores for true duplicate pairs.

> **Note:** The evaluation dataset is small (55 pairs, 10 positive). With only 10 true duplicates, each TP/FP change causes large metric swings. Results should be interpreted with caution.

## Limitations

- **Small evaluation set:** Only 55 human-labeled pairs (10 duplicates). Results should be taken as directional rather than definitive.
- **LLM annotation bias in training data:** Stage 2 training data was annotated by a single LLM (Gemini 2.5 Pro), which may affect calibration.
- **Model size:** ~278M parameters with 768-dim embeddings. The ONNX model is ~1GB.
- **Domain-specific:** Optimized for jailbreak/prompt injection duplicate detection. Performance on general semantic similarity tasks is not evaluated.

- **Single-turn only:** This model was only trained on single-prompt jailbreaks and should not be used to process multi-turn conversations. In the future, we plan to release models that can handle multi-turn jailbreak scenarios.

## Citation

### BibTeX

#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
```

#### ContrastiveLoss
```bibtex
@inproceedings{hadsell2006dimensionality,
    author={Hadsell, R. and Chopra, S. and LeCun, Y.},
    booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
    title={Dimensionality Reduction by Learning an Invariant Mapping},
    year={2006},
    volume={2},
    number={},
    pages={1735-1742},
    doi={10.1109/CVPR.2006.100}
}
```

#### WildJailbreak
```bibtex
@article{jiang2024wildteaming,
    title={WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models},
    author={Jiang, Liwei and Bhatt, Kavel and Phute, Seungju and Hwang, Jaehun and Liang, Dongwei and Sap, Maarten and Hajishirzi, Hannaneh and Choi, Yejin},
    journal={arXiv preprint arXiv:2406.18510},
    year={2024}
}
```