--- tags: - mteb - sentence-transformers - transformers - embedding - bidirectional - multilingual pipeline_tag: sentence-similarity license: apache-2.0 base_model: BidirLM/BidirLM-Omni-2.5B-Embedding language: - multilingual - af - am - ar - az - be - bg - bn - bs - ca - ceb - cs - cy - da - de - el - en - es - et - eu - fa - fi - fr - ga - gl - gu - ha - he - hi - hr - ht - hu - hy - id - ig - is - it - ja - jv - ka - kk - kn - ko - ky - lt - lv - mg - mk - ml - mr - ms - mt - my - nb - ne - nl - nso - ny - pa - pl - ps - pt - ro - ru - sd - si - sk - sl - sn - so - sq - sr - su - sv - sw - ta - te - th - tl - tr - uk - ur - vi - wo - xh - yo - zh - zu library_name: sentence-transformers datasets: - BidirLM/BidirLM-Omni-Contrastive --- # BidirLM-Omni-2.5B BidirLM-Omni is the omnimodal variant of the BidirLM family — a 2.5B bidirectional encoder that jointly embeds **text, images, and audio** into a shared representation space, enabling **state-of-the-art** embedding performance. ![Omnimodal model performance: MTEB Multilingual V2, MIEB (lite), MAEB (beta)](https://huggingface.co/spaces/BidirLM/README/resolve/main/fig6.png) > [!WARNING] > This model should be run with **cuDNN > 9.20.0**. Earlier versions trigger a [Conv3D NVIDIA bug](https://forums.developer.nvidia.com/t/cudnn-bug-report-conv3d-performance-regression-with-bfloat16-float16-on-h100/355210) that significantly slows down inference or training. ## Supported Tasks **Multimodal embeddings** (via Sentence Transformers): cross-modal retrieval (text ↔ image, text ↔ audio), multimodal semantic similarity, clustering, and classification across text, image, and audio modalities. **Text-only downstream fine-tuning** (via Transformers): sequence classification (e.g. MNLI, XNLI), token classification (e.g. NER), sequence regression. **Supported Languages** Multilingual support across over 119 languages, inherited from the Qwen3 base model and reinforced through contrastive training with 87 languages. ## Usage ### Sentence Transformers Pass inputs directly to `encode()`. All modalities produce embeddings in the same 2048-dimensional space and can be compared cross-modally. | Modality | Input type | Notes | |----------|-----------|-------| | **Text** | `str` | Any language; no length limit (model context is 32k tokens) | | **Image** | `PIL.Image.Image` | Any size and aspect ratio; resized internally | | **Audio** | `np.ndarray`, `list[float]`, or `dict` with `"array"` (`np.ndarray`) and `"sampling_rate"` (`int`) | Any sample rate; resampled to 16 kHz internally via `librosa` | | **Mixed** | `list[dict]` conversation (role/content) | Interleave text + image or text + audio in a single prompt — see *Chat Template* below | ```python import numpy as np import PIL.Image from sentence_transformers import SentenceTransformer model = SentenceTransformer("BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True) # Text queries texts = [ "An image with a red background.", "An image with a blue background.", "A deep bass sound.", "A high-pitched sound.", ] # Images, synthetic solid-color 256x256 images images = [ PIL.Image.fromarray(np.full((256, 256, 3), (220, 30, 30), dtype=np.uint8)), # red PIL.Image.fromarray(np.full((256, 256, 3), (30, 30, 220), dtype=np.uint8)), # blue ] # Audio, synthetic sine waves at 16kHz, 2 seconds each sr = 16000 t = np.linspace(0, 2.0, sr * 2, endpoint=False, dtype=np.float32) audios = [ {"array": np.sin(2 * np.pi * 80 * t), "sampling_rate": sr}, # 80 Hz — bass {"array": np.sin(2 * np.pi * 7500 * t), "sampling_rate": sr}, # 7500 Hz — high ] # Encode all modalities and compute similarities text_embeddings = model.encode(texts) image_embeddings = model.encode(images) audio_embeddings = model.encode(audios) # Pass a custom instruction via prompt= (applies to all items in the batch) # text_embeddings = model.encode(texts, prompt="Retrieve semantically similar text.") print(model.similarity(text_embeddings, image_embeddings)) print(model.similarity(text_embeddings, audio_embeddings)) # Text-Image similarity red img blue img # "An image with a red background." [ 0.6928, 0.3103] ← high red match # "An image with a blue background."[ 0.4278, 0.6436] ← high blue match # "A deep bass sound." [ 0.1519, 0.2272] ← low (text/image mismatch) # "A high-pitched sound." [ 0.1418, 0.1812] ← low (text/image mismatch) # Text-Audio similarity 80Hz bass 7500Hz high # "An image with a red background." [ 0.0010, 0.0410] ← low (image/audio mismatch) # "An image with a blue background."[ 0.0526, 0.0642] ← low (image/audio mismatch) # "A deep bass sound." [ 0.5456, 0.4243] ← higher bass match # "A high-pitched sound." [ 0.4004, 0.5177] ← higher high-pitch match ``` ### Transformers - Fine-tuning for Downstream Tasks ```python import numpy as np import PIL.Image from transformers import AutoProcessor, AutoModelForSequenceClassification, AutoModelForTokenClassification processor = AutoProcessor.from_pretrained( "BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True ) sr = 16000 conversation = [ { "role": "user", "content": [ {"type": "image", "image": PIL.Image.fromarray(np.zeros((256, 256, 3), dtype=np.uint8))}, {"type": "audio", "audio": {"array": np.zeros(sr, dtype=np.float32), "sampling_rate": sr}}, {"type": "text", "text": "Your text."}, ], } ] processor.apply_chat_template(conversation, tokenize=True, add_generation_prompt=False) # Sequence classification (e.g., NLI) seq_model = AutoModelForSequenceClassification.from_pretrained( "BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True, num_labels=3, ) # Token classification (e.g., NER) tok_model = AutoModelForTokenClassification.from_pretrained( "BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True, num_labels=7, ) ``` ## Requirements ``` transformers>=5.5.0 sentence-transformers>=5.4.0 librosa>=0.10.0 ``` ## FAQ ### 1. What pooling strategy does this model use? The model uses **mean pooling** across all modalities. This is handled automatically when using Sentence Transformers. ### 2. Do I need `trust_remote_code=True`? Yes. BidirLM-Omni uses a custom bidirectional omnimodal architecture that requires loading custom code from the repository. ### 3. Can I compare embeddings across modalities? Yes. Text, image, and audio embeddings live in the same 2048-dimensional space and can be compared directly using cosine similarity. ### 4. What audio formats and sample rates are supported? Any sample rate is accepted — the model resamples internally using `librosa` when the source rate differs from the native 16 kHz. Three input formats are supported: - `np.ndarray` — a 1-D float32 array of raw samples - `list[float]` — a plain Python list of samples - `dict` with `"array"` (`np.ndarray`) and `"sampling_rate"` (`int`) — the format returned by HuggingFace `datasets` Audio features Any audio format readable by standard libraries (WAV, MP3, FLAC, etc.) can be used by loading it into a NumPy array first (e.g. with `librosa.load` or `soundfile.read`). ## Citation ```bibtex @misc{boizard2026bidirlmtextomnimodalbidirectional, title={BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs}, author={Nicolas Boizard and Théo Deschamps-Berger and Hippolyte Gisserot-Boukhlef and Céline Hudelot and Pierre Colombo}, year={2026}, eprint={2604.02045}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.02045}, } ```