BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
Paper • 2402.03216 • Published • 10
This is an ONNX-optimized version of BAAI/bge-m3 specifically optimized for AWS Graviton4 processors.
BGE-M3 is a versatile embedding model from the FlagEmbedding project that supports:
quantized/ subdirectoryfrom optimum.onnxruntime import ORTModelForCustomTasks
from transformers import AutoTokenizer
# Load from Hugging Face Hub
model = ORTModelForCustomTasks.from_pretrained(
"idomeneo/bge-m3-onnx-graviton4",
file_name="model_optimized.onnx"
)
tokenizer = AutoTokenizer.from_pretrained("idomeneo/bge-m3-onnx-graviton4")
# Tokenize and get embeddings
inputs = tokenizer("Your text here", return_tensors="np", padding=True, truncation=True)
outputs = model.forward(**inputs)
# Access different embedding types
dense_embeddings = outputs["dense_vecs"] # Shape: (batch_size, 1024)
sparse_embeddings = outputs["sparse_vecs"] # Shape: (batch_size, seq_len, 1)
colbert_embeddings = outputs["colbert_vecs"] # Shape: (batch_size, seq_len, 1024)
import onnxruntime as ort
from optimum.onnxruntime import ORTModelForCustomTasks
from transformers import AutoTokenizer
# Enable bfloat16 acceleration for Graviton4
sess_options = ort.SessionOptions()
sess_options.add_session_config_entry("mlas.enable_gemm_fastmath_arm64_bfloat16", "1")
model = ORTModelForCustomTasks.from_pretrained(
"idomeneo/bge-m3-onnx-graviton4",
file_name="model_optimized.onnx",
session_options=sess_options
)
tokenizer = AutoTokenizer.from_pretrained("idomeneo/bge-m3-onnx-graviton4")
The INT8 quantized model provides significant memory savings and faster inference with minimal quality loss:
from optimum.onnxruntime import ORTModelForCustomTasks
from transformers import AutoTokenizer
# Load quantized model
model = ORTModelForCustomTasks.from_pretrained(
"idomeneo/bge-m3-onnx-graviton4",
subfolder="quantized",
file_name="model_optimized_quantized.onnx"
)
tokenizer = AutoTokenizer.from_pretrained("idomeneo/bge-m3-onnx-graviton4")
# Usage is identical to the standard model
inputs = tokenizer("Your text here", return_tensors="np", padding=True, truncation=True)
outputs = model.forward(**inputs)
Comprehensive testing across 20 diverse examples shows excellent quality retention:
| Test Category | Examples | Avg Similarity |
|---|---|---|
| English Technical | Machine learning, neural networks, NLP | 99.98% |
| English General | Common phrases, news topics | 99.97% |
| Multilingual | Chinese, Spanish, French, German, Japanese | 99.97% |
| Domain Specific | SQL queries, Python code, Biology | 99.98% |
| Edge Cases | Single char, emojis, repetitions | 99.97% |
| Semantic Variations | Paraphrases | 99.99% |
Overall Statistics:
On AWS Graviton4 instances, this optimized model provides:
model_optimized.onnx: O3-optimized ONNX model with GELU approximationmodel_optimized.onnx.data: External weights fileconfig.json: Model configurationtokenizer.json: Fast tokenizertokenizer_config.json: Tokenizer configurationsentencepiece.bpe.model: SentencePiece modelspecial_tokens_map.json: Special tokens mappingort_config.json: ONNX Runtime configurationquantized/ subdirectory)
model_optimized_quantized.onnx: INT8 quantized modelmodel_optimized_quantized.onnx.data: Quantized weights@article{bge-m3,
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
author={Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng},
journal={arXiv preprint arXiv:2402.03216},
year={2024}
}
MIT License (inherited from BAAI/bge-m3)
Base model
BAAI/bge-m3