Mastering the Huggingface CLIP Model: How to Extract Embeddings and Calculate Similarity for Text and Images
Huggingface's transformers library is a great resource for natural language processing tasks, and it includes an implementation of OpenAI's CLIP model including a pretrained model clip-vit-large-patch14. The CLIP model is a powerful image and text embedding model that can be used for a wide range of tasks, such as image captioning and similarity search.
The CLIPModel documentation provides examples of how to use the model to calculate the similarity of images and captions, but it is less clear on how to obtain the raw embeddings of the input data. While the documentation provides some guidance on how to use the model's embedding layer, it is not always clear how to extract the embeddings for further analysis or use in other tasks.
Furthermore, the documentation does not cover how to calculate similarity between text and image embeddings yourself. This can be useful for tasks such as image-text matching or precalculating image embeddings for later (or repeated) use.
In this post, we will show how to obtain the raw embeddings from the CLIPModel and how to calculate similarity between them using PyTorch. With this information, you will be able to use the CLIPModel in a more flexible way and adapt it to your specific needs.
Benchmark example: Logit similarity score between text and image embeddings
Here's the example from CLIPModel documentation we'd ideally like to split into text and image embeddings and then calculate the similarity score between them ourselves:
from PIL import Image import requests from transformers import AutoProcessor, CLIPModel model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32") url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor( text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True ) outputs = model(**inputs) logits_per_image = outputs.logits_per_image # this is the image-text similarity score probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
If you run the code and
print(logits_per_image) you should get:
tensor([[18.9041, 11.7159]], grad_fn=<PermuteBackward0>)
The code calculating the logits is found in
forward() function source
Acquiring image and text features separately
from PIL import Image import requests from transformers import AutoProcessor, AutoTokenizer, CLIPModel model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") # Get the text features tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-large-patch14") inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt") text_features = model.get_text_features(**inputs) print(text_features.shape) # output shape of text features # Get the image features processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14") url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(images=image, return_tensors="pt") image_features = model.get_image_features(**inputs) print(image_features.shape) # output shape of image features
Running this should yield the following output:
$ python cliptest.py torch.Size([2, 768]) torch.Size([1, 768])
Looks pretty good! Two 768 item tensors for the two labels, and one similarly sized for the image! Now let's see if we can calculate the similarity between the two...
Calculating image and text cosine similarity
CLIP uses "cosine similarity" which is essentially a dot product of the image and text feature vectors. We can just transpose the other
tensor and multiply these together with
>>> torch.matmul(text_features, image_features.t()) tensor([[64.6993], [40.6225]], grad_fn=<MmBackward0>)
Ouch, not quite
11.7159 we were after. But hey, we forgot to scale the features! Let's calculate the L2 norm (i.e. Euclidean norm / distance) and divide the features first so they become unit vectors:
>>> # normalized features >>> image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True) >>> text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True) >>> torch.matmul(text_features, image_features.t()) tensor([[0.1890], [0.1172]], grad_fn=<MmBackward0>)
Better! As you can see, the result should just be more or less exactly 100x more. This can't be coincidence, can it?
No it is not. If you look at the
forward() function source, you'll see that
CLIPModel scales the resulting norm to "logits" (which I'm not fully familiar, but should represent relative probability scales of different captions). Here's what happens in the library (in different parts of the initialization):
# CLIPModel __init__() self.logit_scale = nn.Parameter(torch.ones() * self.config.logit_scale_init_value) # CLIPModel forward() logit_scale = self.logit_scale.exp() logits_per_text = torch.matmul(text_embeds, image_embeds.t()) * logit_scale
CLIPConfig docs say
logit_scale_init_value = 2.6592, but as you can see, the code basically calculates
e^2.6592 is not 100. BUT: Actually we are using a pretrained model! Let's see what the value is:
>>> model.logit_scale Parameter containing: tensor(4.6052, requires_grad=True) >>> model.logit_scale.exp() tensor(100.0000, grad_fn=<ExpBackward0>)
Bingo! It seems the
clip-vit-large-patch14 abandoned the original paper's default value and replaced it with
log(100) more or less. So here is how you calculate "CLIP compatible" logits from
# normalized features image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True) text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True) logit_scale = model.logit_scale.exp() torch.matmul(text_features, image_features.t()) * logit_scale
This will yield the exact result that we initially got from the "official example".
Note that if you don't care about same exact values, you are able to skip the logit part completely. Just remember to scale image and text features before the dot product. Oh, and you can also use
torch.nn.functional.cosine_similarity if you like that better than
matmul (with or without
similarity = torch.nn.functional.cosine_similarity(text_features, image_features) * logit_scale
That was fun! So what could you do with this? One idea is to build your own image search, like in this Medium article. It was the original inspiration for my journey, as I wanted to use HuggingFace CLIP implementation and the new large model instead of the one used in the article. :)