Sentence Similarity
sentence-transformers
Safetensors
Transformers
bidirlm_omni
mteb
embedding
bidirectional
custom_code
Instructions to use BidirLM/BidirLM-Omni-2.5B-Embedding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use BidirLM/BidirLM-Omni-2.5B-Embedding with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True) sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Transformers
How to use BidirLM/BidirLM-Omni-2.5B-Embedding with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 7,775 Bytes
4d8a7d3 b4ab906 4d8a7d3 447a6e3 4d8a7d3 ba24293 4d8a7d3 b4ab906 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 | ---
tags:
- mteb
- sentence-transformers
- transformers
- embedding
- bidirectional
- multilingual
pipeline_tag: sentence-similarity
license: apache-2.0
base_model: BidirLM/BidirLM-Omni-2.5B-Embedding
language:
- multilingual
- af
- am
- ar
- az
- be
- bg
- bn
- bs
- ca
- ceb
- cs
- cy
- da
- de
- el
- en
- es
- et
- eu
- fa
- fi
- fr
- ga
- gl
- gu
- ha
- he
- hi
- hr
- ht
- hu
- hy
- id
- ig
- is
- it
- ja
- jv
- ka
- kk
- kn
- ko
- ky
- lt
- lv
- mg
- mk
- ml
- mr
- ms
- mt
- my
- nb
- ne
- nl
- nso
- ny
- pa
- pl
- ps
- pt
- ro
- ru
- sd
- si
- sk
- sl
- sn
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- uk
- ur
- vi
- wo
- xh
- yo
- zh
- zu
library_name: sentence-transformers
datasets:
- BidirLM/BidirLM-Omni-Contrastive
---
# BidirLM-Omni-2.5B
BidirLM-Omni is the omnimodal variant of the BidirLM family — a 2.5B bidirectional encoder that jointly embeds **text, images, and audio** into a shared representation space, enabling **state-of-the-art** embedding performance.

> [!WARNING]
> This model should be run with **cuDNN > 9.20.0**. Earlier versions trigger a [Conv3D NVIDIA bug](https://forums.developer.nvidia.com/t/cudnn-bug-report-conv3d-performance-regression-with-bfloat16-float16-on-h100/355210) that significantly slows down inference or training.
## Supported Tasks
**Multimodal embeddings** (via Sentence Transformers): cross-modal retrieval (text ↔ image, text ↔ audio), multimodal semantic similarity, clustering, and classification across text, image, and audio modalities.
**Text-only downstream fine-tuning** (via Transformers): sequence classification (e.g. MNLI, XNLI), token classification (e.g. NER), sequence regression.
**Supported Languages** Multilingual support across over 119 languages, inherited from the Qwen3 base model and reinforced through contrastive training with 87 languages.
## Usage
### Sentence Transformers
Pass inputs directly to `encode()`. All modalities produce embeddings in the same 2048-dimensional space and can be compared cross-modally.
| Modality | Input type | Notes |
|----------|-----------|-------|
| **Text** | `str` | Any language; no length limit (model context is 32k tokens) |
| **Image** | `PIL.Image.Image` | Any size and aspect ratio; resized internally |
| **Audio** | `np.ndarray`, `list[float]`, or `dict` with `"array"` (`np.ndarray`) and `"sampling_rate"` (`int`) | Any sample rate; resampled to 16 kHz internally via `librosa` |
| **Mixed** | `list[dict]` conversation (role/content) | Interleave text + image or text + audio in a single prompt — see *Chat Template* below |
```python
import numpy as np
import PIL.Image
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True)
# Text queries
texts = [
"An image with a red background.",
"An image with a blue background.",
"A deep bass sound.",
"A high-pitched sound.",
]
# Images, synthetic solid-color 256x256 images
images = [
PIL.Image.fromarray(np.full((256, 256, 3), (220, 30, 30), dtype=np.uint8)), # red
PIL.Image.fromarray(np.full((256, 256, 3), (30, 30, 220), dtype=np.uint8)), # blue
]
# Audio, synthetic sine waves at 16kHz, 2 seconds each
sr = 16000
t = np.linspace(0, 2.0, sr * 2, endpoint=False, dtype=np.float32)
audios = [
{"array": np.sin(2 * np.pi * 80 * t), "sampling_rate": sr}, # 80 Hz — bass
{"array": np.sin(2 * np.pi * 7500 * t), "sampling_rate": sr}, # 7500 Hz — high
]
# Encode all modalities and compute similarities
text_embeddings = model.encode(texts)
image_embeddings = model.encode(images)
audio_embeddings = model.encode(audios)
# Pass a custom instruction via prompt= (applies to all items in the batch)
# text_embeddings = model.encode(texts, prompt="Retrieve semantically similar text.")
print(model.similarity(text_embeddings, image_embeddings))
print(model.similarity(text_embeddings, audio_embeddings))
# Text-Image similarity red img blue img
# "An image with a red background." [ 0.6928, 0.3103] ← high red match
# "An image with a blue background."[ 0.4278, 0.6436] ← high blue match
# "A deep bass sound." [ 0.1519, 0.2272] ← low (text/image mismatch)
# "A high-pitched sound." [ 0.1418, 0.1812] ← low (text/image mismatch)
# Text-Audio similarity 80Hz bass 7500Hz high
# "An image with a red background." [ 0.0010, 0.0410] ← low (image/audio mismatch)
# "An image with a blue background."[ 0.0526, 0.0642] ← low (image/audio mismatch)
# "A deep bass sound." [ 0.5456, 0.4243] ← higher bass match
# "A high-pitched sound." [ 0.4004, 0.5177] ← higher high-pitch match
```
### Transformers - Fine-tuning for Downstream Tasks
```python
import numpy as np
import PIL.Image
from transformers import AutoProcessor, AutoModelForSequenceClassification, AutoModelForTokenClassification
processor = AutoProcessor.from_pretrained(
"BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True
)
sr = 16000
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": PIL.Image.fromarray(np.zeros((256, 256, 3), dtype=np.uint8))},
{"type": "audio", "audio": {"array": np.zeros(sr, dtype=np.float32), "sampling_rate": sr}},
{"type": "text", "text": "Your text."},
],
}
]
processor.apply_chat_template(conversation, tokenize=True, add_generation_prompt=False)
# Sequence classification (e.g., NLI)
seq_model = AutoModelForSequenceClassification.from_pretrained(
"BidirLM/BidirLM-Omni-2.5B-Embedding",
trust_remote_code=True,
num_labels=3,
)
# Token classification (e.g., NER)
tok_model = AutoModelForTokenClassification.from_pretrained(
"BidirLM/BidirLM-Omni-2.5B-Embedding",
trust_remote_code=True,
num_labels=7,
)
```
## Requirements
```
transformers>=5.5.0
sentence-transformers>=5.4.0
librosa>=0.10.0
```
## FAQ
### 1. What pooling strategy does this model use?
The model uses **mean pooling** across all modalities. This is handled automatically when using Sentence Transformers.
### 2. Do I need `trust_remote_code=True`?
Yes. BidirLM-Omni uses a custom bidirectional omnimodal architecture that requires loading custom code from the repository.
### 3. Can I compare embeddings across modalities?
Yes. Text, image, and audio embeddings live in the same 2048-dimensional space and can be compared directly using cosine similarity.
### 4. What audio formats and sample rates are supported?
Any sample rate is accepted — the model resamples internally using `librosa` when the source rate differs from the native 16 kHz. Three input formats are supported:
- `np.ndarray` — a 1-D float32 array of raw samples
- `list[float]` — a plain Python list of samples
- `dict` with `"array"` (`np.ndarray`) and `"sampling_rate"` (`int`) — the format returned by HuggingFace `datasets` Audio features
Any audio format readable by standard libraries (WAV, MP3, FLAC, etc.) can be used by loading it into a NumPy array first (e.g. with `librosa.load` or `soundfile.read`).
## Citation
```bibtex
@misc{boizard2026bidirlmtextomnimodalbidirectional,
title={BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs},
author={Nicolas Boizard and Théo Deschamps-Berger and Hippolyte Gisserot-Boukhlef and Céline Hudelot and Pierre Colombo},
year={2026},
eprint={2604.02045},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.02045},
}
``` |