MedPMC-CLIP

MedPMC-CLIP is a medical vision-language model based on the OpenCLIP ViT-L-14 architecture. The model was trained on the MedPMC-11M dataset, a carefully curated collection of approximately 11 million image-caption pairs derived from biomedical literature. Across a wide range of evaluations, MedPMC-CLIP consistently outperforms existing baseline models, including zero-shot medical image classification on 26 public benchmarks and zero-shot image retrieval on an internal clinical dermatology dataset. For additional details on model training and benchmark results, please refer to our paper (coming soon).

This repository provides the checkpoint in OpenCLIP format. Text inputs should be tokenized using the default OpenCLIP tokenizer for ViT-L-14.

tokenizer = open_clip.get_tokenizer("ViT-L-14")

Files

open_clip_pytorch_model.safetensors: OpenCLIP-format model checkpoint
inference_example.py: example code for image-text similarity
requirements.txt: minimal dependencies

Usage

import torch
import open_clip
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from PIL import Image

model_name = "ViT-L-14"
device = "cuda" if torch.cuda.is_available() else "cpu"

model, _, preprocess = open_clip.create_model_and_transforms(
    model_name,
    pretrained=None,
)

repo_id = "Yale-BIDS-Chen/medpmc-clip-l-14_jun24_v1"

ckpt_path = hf_hub_download(
    repo_id=repo_id,
    filename="open_clip_pytorch_model.safetensors",
)

state_dict = load_file(ckpt_path, device="cpu")
model.load_state_dict(state_dict, strict=True)
model = model.to(device)
model.eval()

tokenizer = open_clip.get_tokenizer(model_name)

image = preprocess(Image.open("example.jpg").convert("RGB")).unsqueeze(0).to(device)
text = tokenizer(["fundus photograph", "chest radiograph", "histopathology image"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)

    similarity = image_features @ text_features.T

print(similarity)

Citation

Citation information will be added upon release.

Questions?

For questions or feedback, please contact Hyunjae Kim at hyunjae.kim@yale.edu.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Yale-BIDS-Chen/medpmc-clip-l-14_jun24_v1

MedPMC

Collection

MedPMC resources, including the data curation pipeline, curated datasets, and trained vision-language models. • 21 items • Updated 17 days ago • 1