--- license: mit tags: - open_clip - bioacoustics - multimodal - zero-shot-retrieval --- # BioVITA **BioVITA** is a 3-modal (Audio × Image × Text) representation learning model for wildlife species recognition, trained on the BioVITA dataset. - Image / Text encoder: ViT-L/14 fine-tuned from [BioCLIP-2](https://huggingface.co/imageomics/bioclip-2) - Audio encoder: [CLAP (HTSAT-unfused)](https://huggingface.co/laion/clap-htsat-unfused) fine-tuned with a linear projection adapter ## Files | File | Description | |------|-------------| | `open_clip_pytorch_model.bin` | Image & text encoder weights (OpenCLIP ViT-L/14) | | `open_clip_config.json` | OpenCLIP model config | | `clap_weights.pth` | Audio encoder (CLAP) + adapter weights | | `tokenizer*.json` / `vocab.json` / `merges.txt` | Tokenizer files | ## Usage With the [BioVITA release code](https://github.com/dahlian00/BioVITA): ```bash # Extract features (image + text + audio) torchrun --nproc_per_node=8 eval/extract_features.py \ --ids_dir path/to/benchmark/ids \ --feat_root path/to/output \ --tag biовita \ --vita_model_id risashinoda/BioVITA \ --modalities audio,image,text # Evaluate on BioVITA benchmark python eval/eval_benchmark.py \ --base_dir path/to/benchmark \ --ids_dir path/to/benchmark/ids \ --feat_root path/to/output \ --tag biовita ``` ## Citation ```bibtex @inproceedings{shinoda2026biovita, title = {BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment}, author = {Risa Shinoda and Kaede Shiohara and Nakamasa Inoue and Kuniaki Saito and Hiroaki Santo and Fumio Okura}, booktitle = {CVPR}, year = {2026}, } ```