CLIP Needs Registers. And Gated MLPs. And +20M params.

Fixing CLIP's modality gap via happy little accidents.

Love โค๏ธ this CLIP?

แ… Buy me a coffee on Ko-Fi โ˜•

Or click here for address to send ๐Ÿช™โ‚ฟ BTC
3PscBrWYvrutXedLmvpcnQbE12Py8qLqMK

โ„น๏ธ Update 02/June/2025:

  • You can now load the model with HF 'transformers'. โœ…
  • Unfortunately, AutoModel produced nonsense / I couldn't get "trust_remote_code=True" to work properly - (using that was suggested in response to my pull request on GitHub).

๐Ÿ’ก Alas, you will need to:

  • Download the 'hfmodel' folder
  • Use it for manually importing the correct (my custom) CLIPModel code required as per the config.json
  • Minimal example code:
import torch
from PIL import Image, ImageDraw
import transformers
from hfmodel.modeling_clip import CLIPModel
from transformers import CLIPProcessor
from torchvision.transforms import ToTensor
import torch.nn.functional as F

model = CLIPModel.from_pretrained("zer0int/CLIP-Registers-Gated_MLP-ViT-L-14")
processor = CLIPProcessor.from_pretrained("zer0int/CLIP-Registers-Gated_MLP-ViT-L-14")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

size = 224
im = Image.new("RGB", (size, size), (255, 255, 255))
draw = ImageDraw.Draw(im)

# --------- GPT-4.1's idea of a pineapple. Need an input image... ---------
body_bbox = [size*0.28, size*0.38, size*0.72, size*0.90]
draw.ellipse(body_bbox, fill=(254, 221, 72), outline=(180, 120, 0), width=5)
eye_color = (198, 134, 66)
for row in range(4):
    for col in range(3):
        ex = size*0.36 + col*size*0.09 + (row%2)*size*0.045
        ey = size*0.50 + row*size*0.085
        ew, eh = size*0.035, size*0.025
        draw.ellipse([ex-ew, ey-eh, ex+ew, ey+eh], fill=eye_color, outline=None)
leaf_color = (61, 179, 70)
leaf_base_x = size/2
leaf_base_y = size*0.38
for i, (angle, length) in enumerate([(-28, 65), (-12, 70), (0, 80), (12, 70), (28, 65)]):
    from math import radians, cos, sin
    a = radians(angle)
    tip_x = leaf_base_x + length*sin(a)
    tip_y = leaf_base_y - length*cos(a)
    left = (leaf_base_x + 13*cos(a+1.5), leaf_base_y + 13*sin(a+1.5))
    right = (leaf_base_x + 13*cos(a-1.5), leaf_base_y + 13*sin(a-1.5))
    draw.polygon([left, (tip_x, tip_y), right], fill=leaf_color)
im.save("pineapple.png")
# ---------

image = Image.open("pineapple.png").convert("RGB")
texts = ["pine", "apple", "pineapple", "orange", "pear", "person", "cat", "dog"]

inputs = processor(text=texts, images=image, return_tensors="pt", padding=True).to(device)

with torch.no_grad():
    outputs = model(**inputs)
    image_embeds = outputs.image_embeds
    text_embeds = outputs.text_embeds

image_embeds = F.normalize(image_embeds, dim=-1)
text_embeds = F.normalize(text_embeds, dim=-1)
cos_sim = image_embeds @ text_embeds.T
cos_sim = cos_sim.squeeze(0)

for text, sim in zip(texts, cos_sim):
    print(f"Similarity with '{text}': {sim.item():.4f}")

image/png

I just want a new Text Encoder...

  • ...for my Text-to-Image (Text-to-Video) AI! \o/
  • I recommend this one, the 'sweet spot' ckpt12: ๐Ÿ‘‰ direct download ๐Ÿ‘ˆ
  • Even lower modality gap (text 'more alike' to image, but less accurate): direct download
  • Enjoy! (You don't need to do anything else, they're just normal CLIP Text Encoders!)

โš ๏ธ Full model (currently) not HuggingFace Transformers compatible. โš ๏ธ

  • The ViT (Vision Encoder) is basically a big mutant. Alas:
  • The full model .safetensors have the 'import clip' (OpenAI) structure inside.
  • It's just so you don't need to load any 'danger pickles'. :)
  • Alas, currently it runs with 'import clip' code (I'm working on a HF implementation, though!).
  • However, for now, I made an entire playground for the CLIP models (+ safetensors loading)! ๐ŸŽ‰:
  • ๐ŸŒŸ https://github.com/zer0int/CLIP-fine-tune-registers-gated โœจ
  • All code for fine-tuning it yourself is also included on my Git! ๐Ÿค—

Wait, but what is this?!

  • The Vision Transformer has +4 tokens (Register Tokens).
  • ...And gated ReLU MLPs inside each layer + final Fusion MLP.
  • +20M parameters (~430M -> now: ~450M)
  • It's now a CLIP with an extremely low modality gap.
  • See the table below for details.
  • And if you want to know more about modality gaps & all details please check out the GitHub!

An image is worth 16x16 words, alas:

Attention Heatmap, pre-trained OpenAI CLIP ViT-L/14: image/png

This model, CLIP REG-XGATED: image/png

Text-To-Image examples, Flux.1-dev, pure CLIP (no T5) guidance:

image/png

image/png

Model Performance Overview

Task / Dataset Metric ViT-L/14 OpenAI (Pre-trained) X-GATED (ckpt20 xtreme) X-GATED (ckpt12 balanced) X-GATED (ckpt12 balanced, ablated)
VoC-2007 (Multilabel) mAP 0.7615 0.8140 0.8471 0.8247
MSCOCO Retrieval Image Recall@5 0.2194 0.3565 0.3532 0.3349
Text Recall@5 0.3034 0.5425 0.5278 0.5086
Linear Probe CIFAR-10 Acc@1 0.9535 0.9813 0.9813 0.9811
Acc@5 0.9966 0.9997 0.9997 0.9997
Mean Class Recall 0.9535 0.9813 0.9813 0.9811
MVT ImageNet/ObjectNet (Zero-Shot) Accuracy 0.8453 0.8686 0.8830 0.8815
Linear Probe ILSVRC2012 Top-1 69.86% 66.43% 67.10% 68.99%
Top-5 92.70% 91.52% 91.83% 92.64%
Modality Gap Metrics Euclidean Gap โ†“ 0.8276 0.4740 0.5395 0.7486
JSD โ†“ 0.5200 0.1601 0.1303 0.3310
Wasserstein Distance โ†“ 0.4084 0.1742 0.2102 0.3262
Img-Text Cos Sim (mean) โ†‘ 0.2723 0.4926 0.4794 0.3634
Img-Text Cos Sim (std) 0.0362 0.0814 0.0758 0.0537
Text-Text Cos Sim (mean) 0.6807 0.6657 0.6896 0.6896
Text-Text Cos Sim (std) 0.1344 0.1671 0.1535 0.1535

Bolded values represent the best performance for each metric.

Downloads last month
22
Safetensors
Model size
0.5B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for zer0int/CLIP-Registers-Gated_MLP-ViT-L-14

Finetuned
(132)
this model
Finetunes
2 models

Dataset used to train zer0int/CLIP-Registers-Gated_MLP-ViT-L-14