--- language: - en base_model: - facebook/metaclip-b16-400m tags: - medical - biology --- ## Model Description This model is a CLIP-style vision–language model trained on 20% of BIOMEDICA dataset using CLIPScore. **Technical Specifications:** * **Base model:** `facebook/metaclip-b16-400m` (CLIP-like architecture) * **Architecture:** `CLIPModel` from the `transformers` library * **Processor:** `CLIPProcessor` (handles both image and text preprocessing) **Example Usage** ```python from transformers import CLIPProcessor, CLIPModel from PIL import Image import requests import torch model_id = "Mihara-bot/metaclip-b16-400m-biomedica_CLIPScore_20" processor = CLIPProcessor.from_pretrained(model_id) model = CLIPModel.from_pretrained(model_id) # Example image & text url = "https://your-image-url" image = Image.open(requests.get(url, stream=True).raw).convert("RGB") texts = ["a medical image of ...", "a normal image of ..."] inputs = processor( text=texts, images=image, return_tensors="pt", padding="max_length", truncation=True, max_length=77, ) with torch.no_grad(): outputs = model( pixel_values=inputs["pixel_values"], input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], ) image_embeds = outputs.image_embeds # (batch, dim) text_embeds = outputs.text_embeds # (batch, dim) # Calculate similarity logits_per_image = image_embeds @ text_embeds.t() probs = logits_per_image.softmax(dim=-1) print(probs) ``` **Intended Use** - Vision–language tasks such as image–text retrieval, zero-shot classification, or image–text similarity in the biomedical/medical domain (depending on the specific dataset subset used). - Research on data selection, influence functions, and the efficient adaptation of CLIP models. **Not Intended For** Any safety‑critical clinical diagnosis or automated medical decision-making. Any deployment without human oversight, especially within healthcare environments. **Limitations** The model is trained on selected subsets of the BIOMEDICA dataset; it may reflect the biases and coverage limitations of the underlying dataset. Performance outside the target domain (e.g., general web images) is likely weaker than generic CLIP models. Training text largely consists of short captions; performance on long, structured clinical narratives may be limited. ## Citation If you find this model useful, please cite the CHIPS paper: ``` @misc{zhuang2025chipsefficientclipadaptation, title={CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection}, author={Xinlin Zhuang and Yichen Li and Xiwei Liu and Haolin Yang and Yifan Lu and Ziyun Zou and Yulong Li and Huifa Li and Dongliang Chen and Qinglei Wang and Weiyang Liu and Ying Qian and Jiangming Shi and Imran Razzak}, year={2025}, eprint={2511.18519}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2511.18519}, } ```