--- library_name: open_clip tags: - clip - openclip - medical - vision-language - image-text-retrieval - medpmc --- # MedPMC-CLIP MedPMC-CLIP is a medical vision-language model based on the OpenCLIP `ViT-L-14` architecture. The model was trained on the [MedPMC-11M dataset](https://huggingface.co/datasets/Yale-BIDS-Chen/medpmc-11m-dataset_jun24_baseline), a carefully curated collection of approximately 11 million image-caption pairs derived from biomedical literature. Across a wide range of evaluations, MedPMC-CLIP consistently outperforms existing baseline models, including zero-shot medical image classification on 26 public benchmarks and zero-shot image retrieval on an internal clinical dermatology dataset. For additional details on model training and benchmark results, please refer to our paper (coming soon). This repository provides the checkpoint in **OpenCLIP format**. Text inputs should be tokenized using the default OpenCLIP tokenizer for `ViT-L-14`. ```python tokenizer = open_clip.get_tokenizer("ViT-L-14") ``` ## Files - `open_clip_pytorch_model.safetensors`: OpenCLIP-format model checkpoint - `inference_example.py`: example code for image-text similarity - `requirements.txt`: minimal dependencies ## Usage ```python import torch import open_clip from safetensors.torch import load_file from huggingface_hub import hf_hub_download from PIL import Image model_name = "ViT-L-14" device = "cuda" if torch.cuda.is_available() else "cpu" model, _, preprocess = open_clip.create_model_and_transforms( model_name, pretrained=None, ) repo_id = "Yale-BIDS-Chen/medpmc-clip-l-14_jun24_v1" ckpt_path = hf_hub_download( repo_id=repo_id, filename="open_clip_pytorch_model.safetensors", ) state_dict = load_file(ckpt_path, device="cpu") model.load_state_dict(state_dict, strict=True) model = model.to(device) model.eval() tokenizer = open_clip.get_tokenizer(model_name) image = preprocess(Image.open("example.jpg").convert("RGB")).unsqueeze(0).to(device) text = tokenizer(["fundus photograph", "chest radiograph", "histopathology image"]).to(device) with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text) image_features = image_features / image_features.norm(dim=-1, keepdim=True) text_features = text_features / text_features.norm(dim=-1, keepdim=True) similarity = image_features @ text_features.T print(similarity) ``` ## Citation Citation information will be added upon release. ## Questions? For questions or feedback, please contact Hyunjae Kim at ```hyunjae.kim@yale.edu```.