Code and Life

Programming, electronics and other cool tech stuff

Supported by

Supported by Picotech

Mastering the Huggingface CLIP Model: How to Extract Embeddings and Calculate Similarity for Text and Images

Article image, neural network transforming images and text into vector data

Huggingface's transformers library is a great resource for natural language processing tasks, and it includes an implementation of OpenAI's CLIP model including a pretrained model clip-vit-large-patch14. The CLIP model is a powerful image and text embedding model that can be used for a wide range of tasks, such as image captioning and similarity search.

The CLIPModel documentation provides examples of how to use the model to calculate the similarity of images and captions, but it is less clear on how to obtain the raw embeddings of the input data. While the documentation provides some guidance on how to use the model's embedding layer, it is not always clear how to extract the embeddings for further analysis or use in other tasks.

Furthermore, the documentation does not cover how to calculate similarity between text and image embeddings yourself. This can be useful for tasks such as image-text matching or precalculating image embeddings for later (or repeated) use.

In this post, we will show how to obtain the raw embeddings from the CLIPModel and how to calculate similarity between them using PyTorch. With this information, you will be able to use the CLIPModel in a more flexible way and adapt it to your specific needs.

Benchmark example: Logit similarity score between text and image embeddings

Here's the example from CLIPModel documentation we'd ideally like to split into text and image embeddings and then calculate the similarity score between them ourselves:

from PIL import Image
import requests
from transformers import AutoProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(
    text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

If you run the code and print(logits_per_image) you should get:

tensor([[18.9041, 11.7159]], grad_fn=<PermuteBackward0>)

The code calculating the logits is found in forward() function source

Acquiring image and text features separately

There are pretty promising looking examples in get_text_features() and get_image_features() that we can use to get CLIP features for either in tensor form:

from PIL import Image
import requests
from transformers import AutoProcessor, AutoTokenizer, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")

# Get the text features
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-large-patch14")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)

print(text_features.shape) # output shape of text features

# Get the image features
processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

image_features = model.get_image_features(**inputs)

print(image_features.shape) # output shape of image features

Running this should yield the following output:

$ python cliptest.py
torch.Size([2, 768])
torch.Size([1, 768])

Looks pretty good! Two 768 item tensors for the two labels, and one similarly sized for the image! Now let's see if we can calculate the similarity between the two...

Calculating image and text cosine similarity

CLIP uses "cosine similarity" which is essentially a dot product of the image and text feature vectors. We can just transpose the other tensor and multiply these together with torch:

>>> torch.matmul(text_features, image_features.t())
tensor([[64.6993],
        [40.6225]], grad_fn=<MmBackward0>)

Ouch, not quite 18.9041 and 11.7159 we were after. But hey, we forgot to scale the features! Let's calculate the L2 norm (i.e. Euclidean norm / distance) and divide the features first so they become unit vectors:

>>> # normalized features
>>> image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)
>>> text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)
>>> torch.matmul(text_features, image_features.t())
tensor([[0.1890],
        [0.1172]], grad_fn=<MmBackward0>)

Better! As you can see, the result should just be more or less exactly 100x more. This can't be coincidence, can it?

No it is not. If you look at the forward() function source, you'll see that CLIPModel scales the resulting norm to "logits" (which I'm not fully familiar, but should represent relative probability scales of different captions). Here's what happens in the library (in different parts of the initialization):

# CLIPModel __init__()
self.logit_scale = nn.Parameter(torch.ones([]) * self.config.logit_scale_init_value)

# CLIPModel forward()
logit_scale = self.logit_scale.exp()
logits_per_text = torch.matmul(text_embeds, image_embeds.t()) * logit_scale

Now the CLIPConfig docs say logit_scale_init_value = 2.6592, but as you can see, the code basically calculates exp(initval) and e^2.6592 is not 100. BUT: Actually we are using a pretrained model! Let's see what the value is:

>>> model.logit_scale
Parameter containing:
tensor(4.6052, requires_grad=True)
>>> model.logit_scale.exp()
tensor(100.0000, grad_fn=<ExpBackward0>)

Bingo! It seems the clip-vit-large-patch14 abandoned the original paper's default value and replaced it with log(100) more or less. So here is how you calculate "CLIP compatible" logits from image_features and text_features:

# normalized features
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)
logit_scale = model.logit_scale.exp()
torch.matmul(text_features, image_features.t()) * logit_scale

This will yield the exact result that we initially got from the "official example".

Note that if you don't care about same exact values, you are able to skip the logit part completely. Just remember to scale image and text features before the dot product. Oh, and you can also use torch.nn.functional.cosine_similarity if you like that better than matmul (with or without logit_scale):

similarity = torch.nn.functional.cosine_similarity(text_features, image_features) * logit_scale

Conclusion

That was fun! So what could you do with this? One idea is to build your own image search, like in this Medium article. It was the original inspiration for my journey, as I wanted to use HuggingFace CLIP implementation and the new large model instead of the one used in the article. :)