---
library_name: open_clip
tags:
- clip
- openclip
- medical
- vision-language
- image-text-retrieval
- medpmc
---

# MedPMC-CLIP

MedPMC-CLIP is a medical vision-language model based on the OpenCLIP `ViT-L-14` architecture.
The model was trained on the [MedPMC-11M dataset](https://huggingface.co/datasets/Yale-BIDS-Chen/medpmc-11m-dataset_jun24_baseline), a carefully curated collection of approximately 11 million image-caption pairs derived from biomedical literature.
Across a wide range of evaluations, MedPMC-CLIP consistently outperforms existing baseline models, including zero-shot medical image classification on 26 public benchmarks and zero-shot image retrieval on an internal clinical dermatology dataset.
For additional details on model training and benchmark results, please refer to our paper (coming soon).

This repository provides the checkpoint in **OpenCLIP format**. Text inputs should be tokenized using the default OpenCLIP tokenizer for `ViT-L-14`.

```python
tokenizer = open_clip.get_tokenizer("ViT-L-14")
```

## Files

- `open_clip_pytorch_model.safetensors`: OpenCLIP-format model checkpoint
- `inference_example.py`: example code for image-text similarity
- `requirements.txt`: minimal dependencies

## Usage

```python
import torch
import open_clip
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from PIL import Image

model_name = "ViT-L-14"
device = "cuda" if torch.cuda.is_available() else "cpu"

model, _, preprocess = open_clip.create_model_and_transforms(
    model_name,
    pretrained=None,
)

repo_id = "Yale-BIDS-Chen/medpmc-clip-l-14_jun24_v1"

ckpt_path = hf_hub_download(
    repo_id=repo_id,
    filename="open_clip_pytorch_model.safetensors",
)

state_dict = load_file(ckpt_path, device="cpu")
model.load_state_dict(state_dict, strict=True)
model = model.to(device)
model.eval()

tokenizer = open_clip.get_tokenizer(model_name)

image = preprocess(Image.open("example.jpg").convert("RGB")).unsqueeze(0).to(device)
text = tokenizer(["fundus photograph", "chest radiograph", "histopathology image"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)

    similarity = image_features @ text_features.T

print(similarity)
```

## Citation

Citation information will be added upon release.

## Questions?

For questions or feedback, please contact Hyunjae Kim at ```hyunjae.kim@yale.edu```.