File size: 7,775 Bytes
4d8a7d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b4ab906
 
4d8a7d3
 
 
 
 
 
 
 
447a6e3
 
 
4d8a7d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ba24293
4d8a7d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b4ab906
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
---
tags:
- mteb
- sentence-transformers
- transformers
- embedding
- bidirectional
- multilingual
pipeline_tag: sentence-similarity
license: apache-2.0
base_model: BidirLM/BidirLM-Omni-2.5B-Embedding
language:
- multilingual
- af
- am
- ar
- az
- be
- bg
- bn
- bs
- ca
- ceb
- cs
- cy
- da
- de
- el
- en
- es
- et
- eu
- fa
- fi
- fr
- ga
- gl
- gu
- ha
- he
- hi
- hr
- ht
- hu
- hy
- id
- ig
- is
- it
- ja
- jv
- ka
- kk
- kn
- ko
- ky
- lt
- lv
- mg
- mk
- ml
- mr
- ms
- mt
- my
- nb
- ne
- nl
- nso
- ny
- pa
- pl
- ps
- pt
- ro
- ru
- sd
- si
- sk
- sl
- sn
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- uk
- ur
- vi
- wo
- xh
- yo
- zh
- zu
library_name: sentence-transformers
datasets:
- BidirLM/BidirLM-Omni-Contrastive
---

# BidirLM-Omni-2.5B

BidirLM-Omni is the omnimodal variant of the BidirLM family — a 2.5B bidirectional encoder that jointly embeds **text, images, and audio** into a shared representation space, enabling **state-of-the-art** embedding performance.

![Omnimodal model performance: MTEB Multilingual V2, MIEB (lite), MAEB (beta)](https://huggingface.co/spaces/BidirLM/README/resolve/main/fig6.png)

> [!WARNING]
> This model should be run with **cuDNN > 9.20.0**. Earlier versions trigger a [Conv3D NVIDIA bug](https://forums.developer.nvidia.com/t/cudnn-bug-report-conv3d-performance-regression-with-bfloat16-float16-on-h100/355210) that significantly slows down inference or training.

## Supported Tasks

**Multimodal embeddings** (via Sentence Transformers): cross-modal retrieval (text ↔ image, text ↔ audio), multimodal semantic similarity, clustering, and classification across text, image, and audio modalities.

**Text-only downstream fine-tuning** (via Transformers): sequence classification (e.g. MNLI, XNLI), token classification (e.g. NER), sequence regression.

**Supported Languages** Multilingual support across over 119 languages, inherited from the Qwen3 base model and reinforced through contrastive training with 87 languages.

## Usage

### Sentence Transformers

Pass inputs directly to `encode()`. All modalities produce embeddings in the same 2048-dimensional space and can be compared cross-modally.

| Modality | Input type | Notes |
|----------|-----------|-------|
| **Text** | `str` | Any language; no length limit (model context is 32k tokens) |
| **Image** | `PIL.Image.Image` | Any size and aspect ratio; resized internally |
| **Audio** | `np.ndarray`, `list[float]`, or `dict` with `"array"` (`np.ndarray`) and `"sampling_rate"` (`int`) | Any sample rate; resampled to 16 kHz internally via `librosa` |
| **Mixed** | `list[dict]` conversation (role/content) | Interleave text + image or text + audio in a single prompt — see *Chat Template* below |

```python
import numpy as np
import PIL.Image
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True)

# Text queries
texts = [
    "An image with a red background.",
    "An image with a blue background.",
    "A deep bass sound.",
    "A high-pitched sound.",
]

# Images, synthetic solid-color 256x256 images
images = [
    PIL.Image.fromarray(np.full((256, 256, 3), (220, 30, 30), dtype=np.uint8)),  # red
    PIL.Image.fromarray(np.full((256, 256, 3), (30, 30, 220), dtype=np.uint8)),  # blue
]

# Audio, synthetic sine waves at 16kHz, 2 seconds each
sr = 16000
t  = np.linspace(0, 2.0, sr * 2, endpoint=False, dtype=np.float32)
audios = [
    {"array": np.sin(2 * np.pi *   80 * t), "sampling_rate": sr},  #   80 Hz — bass
    {"array": np.sin(2 * np.pi * 7500 * t), "sampling_rate": sr},  # 7500 Hz — high
]

# Encode all modalities and compute similarities
text_embeddings  = model.encode(texts)
image_embeddings = model.encode(images)
audio_embeddings = model.encode(audios)

# Pass a custom instruction via prompt= (applies to all items in the batch)
# text_embeddings  = model.encode(texts, prompt="Retrieve semantically similar text.")

print(model.similarity(text_embeddings, image_embeddings))
print(model.similarity(text_embeddings, audio_embeddings))

# Text-Image similarity             red img   blue img
# "An image with a red background." [ 0.6928,   0.3103]  ← high red match
# "An image with a blue background."[ 0.4278,   0.6436]  ← high blue match
# "A deep bass sound."              [ 0.1519,   0.2272]  ← low (text/image mismatch)
# "A high-pitched sound."           [ 0.1418,   0.1812]  ← low (text/image mismatch)

# Text-Audio similarity             80Hz bass  7500Hz high
# "An image with a red background." [ 0.0010,   0.0410]  ← low (image/audio mismatch)
# "An image with a blue background."[ 0.0526,   0.0642]  ← low (image/audio mismatch)
# "A deep bass sound."              [ 0.5456,   0.4243]  ← higher bass match
# "A high-pitched sound."           [ 0.4004,   0.5177]  ← higher high-pitch match
```


### Transformers - Fine-tuning for Downstream Tasks

```python
import numpy as np
import PIL.Image
from transformers import AutoProcessor, AutoModelForSequenceClassification, AutoModelForTokenClassification

processor = AutoProcessor.from_pretrained(
    "BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True
)

sr = 16000
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": PIL.Image.fromarray(np.zeros((256, 256, 3), dtype=np.uint8))},
            {"type": "audio", "audio": {"array": np.zeros(sr, dtype=np.float32), "sampling_rate": sr}},
            {"type": "text",  "text": "Your text."},
        ],
    }
]
processor.apply_chat_template(conversation, tokenize=True, add_generation_prompt=False)


# Sequence classification (e.g., NLI)
seq_model = AutoModelForSequenceClassification.from_pretrained(
    "BidirLM/BidirLM-Omni-2.5B-Embedding",
    trust_remote_code=True,
    num_labels=3,
)

# Token classification (e.g., NER)
tok_model = AutoModelForTokenClassification.from_pretrained(
    "BidirLM/BidirLM-Omni-2.5B-Embedding",
    trust_remote_code=True,
    num_labels=7,
)
```

## Requirements

```
transformers>=5.5.0
sentence-transformers>=5.4.0
librosa>=0.10.0
```

## FAQ

### 1. What pooling strategy does this model use?

The model uses **mean pooling** across all modalities. This is handled automatically when using Sentence Transformers.

### 2. Do I need `trust_remote_code=True`?

Yes. BidirLM-Omni uses a custom bidirectional omnimodal architecture that requires loading custom code from the repository.

### 3. Can I compare embeddings across modalities?

Yes. Text, image, and audio embeddings live in the same 2048-dimensional space and can be compared directly using cosine similarity.

### 4. What audio formats and sample rates are supported?

Any sample rate is accepted — the model resamples internally using `librosa` when the source rate differs from the native 16 kHz. Three input formats are supported:

- `np.ndarray` — a 1-D float32 array of raw samples
- `list[float]` — a plain Python list of samples
- `dict` with `"array"` (`np.ndarray`) and `"sampling_rate"` (`int`) — the format returned by HuggingFace `datasets` Audio features

Any audio format readable by standard libraries (WAV, MP3, FLAC, etc.) can be used by loading it into a NumPy array first (e.g. with `librosa.load` or `soundfile.read`).

## Citation

```bibtex
@misc{boizard2026bidirlmtextomnimodalbidirectional,
      title={BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs}, 
      author={Nicolas Boizard and Théo Deschamps-Berger and Hippolyte Gisserot-Boukhlef and Céline Hudelot and Pierre Colombo},
      year={2026},
      eprint={2604.02045},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.02045}, 
}
```