--- tags: - sentence-transformers - sentence-similarity - feature-extraction - dense - onnx - onnxruntime - ai-security - duplicate-detection - jailbreak-detection language: multilingual pipeline_tag: sentence-similarity library_name: onnx --- # jailbreak-embeddings-base-onnx ONNX export of the `multilingual-e5-base-wjb-threatfeed_v1` model — a fine-tuned [sentence-transformers](https://www.SBERT.net) model for detecting duplicate vulnerability submissions (jailbreak and prompt injection attacks) in the 0din threat feed. It maps prompts to a 768-dimensional dense vector space optimized for semantic similarity comparison of attack prompts. This model achieves a **+50.6% F1 improvement** over the OpenAI `text-embedding-3-large` baseline on duplicate detection. ## Model Details ### Model Description - **Model Type:** Sentence Transformer (two-stage fine-tuned), exported to ONNX - **Base Model:** [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) (~278M parameters) - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 768 dimensions - **Similarity Function:** Cosine Similarity - **Language:** Multilingual (XLM-RoBERTa backbone) - **Format:** ONNX (compatible with onnxruntime, tract-onnx, and other ONNX runtimes) ### Embedding Pipeline ``` Input Text → Tokenizer → ONNX Model → Mean Pooling → L2 Normalization → Embedding ``` The ONNX model contains only the transformer backbone. Mean pooling and L2 normalization must be implemented in application code (see usage examples below). ### Model Inputs The ONNX model requires 3 inputs: - `input_ids`: Token IDs from tokenizer - `attention_mask`: 1 for real tokens, 0 for padding - `token_type_ids`: All zeros for single-sentence embeddings ### ONNX Verification The ONNX export produces **bit-for-bit identical** embeddings to the native sentence-transformers model (0.000000 max difference across all test sentences). ## Intended Use This model is designed for: - **Duplicate detection** in AI security vulnerability reports (jailbreak/prompt injection attacks) - **Semantic similarity** comparison of attack prompts that may use different surface-level techniques but target the same underlying vulnerability - **Embedding generation** for LSH-based similarity search in vulnerability management systems - **Edge/server deployment** via ONNX runtime without requiring PyTorch The model is trained to recognize semantic equivalence between attack prompts even when they use different jailbreak tactics (e.g., role-playing, encoding, academic framing) to elicit the same harmful behavior. ## Usage ### sentence-transformers (with ONNX backend) ```python from sentence_transformers import SentenceTransformer # Load directly with ONNX backend model = SentenceTransformer("0dinai/jailbreak-embeddings-base-onnx", backend="onnx") sentences = ["First attack prompt", "Second attack prompt"] embeddings = model.encode(sentences) similarity = model.similarity(embeddings, embeddings) print(similarity) ``` ### Python (onnxruntime) ```python import numpy as np import onnxruntime as ort from tokenizers import Tokenizer # Load model and tokenizer session = ort.InferenceSession("onnx/model.onnx") tokenizer = Tokenizer.from_file("tokenizer.json") tokenizer.enable_padding(pad_id=1, pad_token="") tokenizer.enable_truncation(max_length=512) # Tokenize texts = ["First attack prompt", "Second attack prompt"] encodings = tokenizer.encode_batch(texts) input_ids = np.array([e.ids for e in encodings], dtype=np.int64) attention_mask = np.array([e.attention_mask for e in encodings], dtype=np.int64) token_type_ids = np.zeros_like(input_ids) # Run ONNX inference outputs = session.run(None, { "input_ids": input_ids, "attention_mask": attention_mask, "token_type_ids": token_type_ids, }) token_embeddings = outputs[0] # [batch, seq_len, 768] # Mean pooling mask = attention_mask[:, :, np.newaxis].astype(np.float32) embeddings = (token_embeddings * mask).sum(axis=1) / mask.sum(axis=1) # L2 normalization norms = np.linalg.norm(embeddings, axis=1, keepdims=True) embeddings = embeddings / norms # Cosine similarity similarity = np.dot(embeddings[0], embeddings[1]) print(f"Similarity: {similarity:.4f}") ``` ### Rust (tract-onnx) ```rust use tract_onnx::prelude::*; use tokenizers::Tokenizer; // Load model and tokenizer let model = tract_onnx::onnx() .model_for_path("onnx/model.onnx")? .into_optimized()? .into_runnable()?; let tokenizer = Tokenizer::from_file("tokenizer.json")?; // Tokenize let encoding = tokenizer.encode("Attack prompt text", true)?; let input_ids: Vec = encoding.get_ids().iter().map(|&x| x as i64).collect(); let attention_mask: Vec = encoding.get_attention_mask().iter().map(|&x| x as i64).collect(); let token_type_ids: Vec = vec![0i64; input_ids.len()]; // Run inference, then apply mean pooling + L2 normalization // (see full Rust implementation at github.com/0din-ai) ``` ## Training Details This model was trained using a **two-stage fine-tuning approach**: ### Stage 1: WildJailbreak Pre-training Pre-trained on public synthetic data to learn jailbreak semantics. - **Dataset:** [Allen AI WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) — vanilla-adversarial prompt pairs - **Pairs:** 161,396 positive pairs (same intent, different formulation) - **Split:** 153,326 train / 4,034 val / 4,036 test (95% / 2.5% / 2.5%) - **Loss:** MultipleNegativesRankingLoss (in-batch negatives) - **Batch size:** 16 (per device) x 2 gradient accumulation steps = 32 effective - **Learning rate:** 1e-5 - **FP16:** True - **Purpose:** Teach the model to see through jailbreak wrappers and match prompts by underlying intent ### Stage 2: Threat Feed Fine-tuning Fine-tuned on annotated pairs from the internal 0din threat feed. - **Pairs:** 9,598 annotated pairs (7,678 train / 958 val / 962 test) - **Label Distribution:** ~34% duplicates / ~66% non-duplicates - **Annotation:** Google Gemini 2.5 Pro (single-model annotation) - **Source Similarity Threshold:** Candidate pairs generated with Thor similarity >= 0.5 - **Loss:** ContrastiveLoss (cosine distance, margin=0.5) - **Purpose:** Calibrate the model for real-world duplicate detection on production vulnerability data #### Stage 2 Hyperparameters | Parameter | Value | |-----------|-------| | Epochs | 50 (early stopped) | | Batch size | 8 (per device) x 4 gradient accumulation = 32 effective | | Learning rate | 1e-5 | | LR scheduler | Linear | | Warmup ratio | 0.1 | | Weight decay | 0.01 | | FP16 | True | | Early stopping patience | 10 | | Eval steps | 50 | | Seed | 1 | | Best checkpoint | Step 1200 (epoch 5.0) | | Best validation loss | 0.0149 | ## Evaluation Results ### Duplicate Detection Performance Evaluated on 55 human-labeled vulnerability pairs (10 duplicates, 45 non-duplicates) from a corpus of 3,749 vulnerabilities. Best F1 score at each model's optimal threshold: | Model | Best F1 | Threshold | Precision | Recall | |-------|---------|-----------|-----------|--------| | OpenAI text-embedding-3-large (baseline) | 0.462 | 0.80 | 1.000 | 0.300 | | Finetuned V1 (WildJailbreak only, e5-small) | 0.500 | 0.50 | 0.333 | 1.000 | | Finetuned V2 (WJB + threat feed v1, e5-small) | 0.526 | 0.70 | 0.556 | 0.500 | | Finetuned V3 (WJB + threat feed v2, e5-small) | 0.556 | 0.75 | 0.625 | 0.500 | | Finetuned V4 (WJB + threat feed 10k, e5-small) | 0.600 | 0.70 | 0.600 | 0.600 | | **This model (Base V1)** | **0.696** | **0.70** | **0.615** | **0.800** | ### Threshold Analysis (This Model) | Threshold | Precision | Recall | F1 | TP | FP | FN | TN | |-----------|-----------|--------|------|----|----|----|----| | 0.50 | 0.243 | 0.900 | 0.383 | 9 | 28 | 1 | 17 | | 0.55 | 0.308 | 0.800 | 0.444 | 8 | 18 | 2 | 27 | | 0.60 | 0.381 | 0.800 | 0.516 | 8 | 13 | 2 | 32 | | 0.65 | 0.500 | 0.800 | 0.615 | 8 | 8 | 2 | 37 | | **0.70** | **0.615** | **0.800** | **0.696** | **8** | **5** | **2** | **40** | | 0.75 | 0.625 | 0.500 | 0.556 | 5 | 3 | 5 | 42 | | 0.80 | 0.800 | 0.400 | 0.533 | 4 | 1 | 6 | 44 | | 0.85 | 1.000 | 0.300 | 0.462 | 3 | 0 | 7 | 45 | | 0.90 | 1.000 | 0.100 | 0.182 | 1 | 0 | 9 | 45 | ### Key Findings - **+50.6% F1 improvement** over the OpenAI text-embedding-3-large baseline (0.696 vs 0.462) - **Largest single jump in the series:** +16% F1 over the e5-small V4 model (0.696 vs 0.600), showing that model capacity matters for this task. - **Substantially higher recall:** At threshold 0.70, this model achieves 0.800 recall vs 0.600 for e5-small V4, while maintaining comparable precision (0.615 vs 0.600). - **Wide effective threshold band:** Recall stays at 0.800 across thresholds 0.50–0.70, suggesting the larger model produces more confident and well-separated similarity scores for true duplicate pairs. > **Note:** The evaluation dataset is small (55 pairs, 10 positive). With only 10 true duplicates, each TP/FP change causes large metric swings. Results should be interpreted with caution. ## Limitations - **Small evaluation set:** Only 55 human-labeled pairs (10 duplicates). Results should be taken as directional rather than definitive. - **LLM annotation bias in training data:** Stage 2 training data was annotated by a single LLM (Gemini 2.5 Pro), which may affect calibration. - **Model size:** ~278M parameters with 768-dim embeddings. The ONNX model is ~1GB. - **Domain-specific:** Optimized for jailbreak/prompt injection duplicate detection. Performance on general semantic similarity tasks is not evaluated. - **Single-turn only:** This model was only trained on single-prompt jailbreaks and should not be used to process multi-turn conversations. In the future, we plan to release models that can handle multi-turn jailbreak scenarios. ## Citation ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### ContrastiveLoss ```bibtex @inproceedings{hadsell2006dimensionality, author={Hadsell, R. and Chopra, S. and LeCun, Y.}, booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)}, title={Dimensionality Reduction by Learning an Invariant Mapping}, year={2006}, volume={2}, number={}, pages={1735-1742}, doi={10.1109/CVPR.2006.100} } ``` #### WildJailbreak ```bibtex @article{jiang2024wildteaming, title={WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models}, author={Jiang, Liwei and Bhatt, Kavel and Phute, Seungju and Hwang, Jaehun and Liang, Dongwei and Sap, Maarten and Hajishirzi, Hannaneh and Choi, Yejin}, journal={arXiv preprint arXiv:2406.18510}, year={2024} } ```