Mastering the Huggingface CLIP Model: How to Extract Embeddings and Calculate Similarity for Text and Images

Article image, neural network transforming images and text into vector data

Huggingface's transformers library is a great resource for natural language processing tasks, and it includes an implementation of OpenAI's CLIP model including a pretrained model clip-vit-large-patch14. The CLIP model is a powerful image and text embedding model that can be used for a wide range of tasks, such as image captioning and similarity search.

The CLIPModel documentation provides examples of how to use the model to calculate the similarity of images and captions, but it is less clear on how to obtain the raw embeddings of the input data. While the documentation provides some guidance on how to use the model's embedding layer, it is not always clear how to extract the embeddings for further analysis or use in other tasks.

Furthermore, the documentation does not cover how to calculate similarity between text and image embeddings yourself. This can be useful for tasks such as image-text matching or precalculating image embeddings for later (or repeated) use.

In this post, we will show how to obtain the raw embeddings from the CLIPModel and how to calculate similarity between them using PyTorch. With this information, you will be able to use the CLIPModel in a more flexible way and adapt it to your specific needs.

Benchmark example: Logit similarity score between text and image embeddings

Here's the example from CLIPModel documentation we'd ideally like to split into text and image embeddings and then calculate the similarity score between them ourselves:

from PIL import Image
import requests
from transformers import AutoProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(
    text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

If you run the code and print(logits_per_image) you should get:

tensor([[18.9041, 11.7159]], grad_fn=<PermuteBackward0>)

The code calculating the logits is found in forward() function source

Acquiring image and text features separately

There are pretty promising looking examples in get_text_features() and get_image_features() that we can use to get CLIP features for either in tensor form:

from PIL import Image
import requests
from transformers import AutoProcessor, AutoTokenizer, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")

# Get the text features
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-large-patch14")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)

print(text_features.shape) # output shape of text features

# Get the image features
processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

image_features = model.get_image_features(**inputs)

print(image_features.shape) # output shape of image features

Running this should yield the following output:

$ python cliptest.py
torch.Size([2, 768])
torch.Size([1, 768])

Looks pretty good! Two 768 item tensors for the two labels, and one similarly sized for the image! Now let's see if we can calculate the similarity between the two...

Read post

Analyzing 433 MHz Nexa Smart Power Plug Remote Control Signal with Arduino Uno

A friend recently started a project to remotely boot his router (which tends to hang randomly) with Raspberry Pi. Unfortunately, the rpi-rf tool was not quite recognizing the signals. I pitched in to help, and as he did not have access to an oscilloscope, but had an Arduino Uno, I thought maybe I could figure it out with that.

Fast forward a few weeks later, I have been experimenting with four methods analyzing my own Nexa 433 MHz remote controller:

  1. Arduino Uno
  2. Soundcard a.k.a. "poor man's oscilloscope"
  3. Raspberry Pi
  4. An actual oscilloscope, namely my Picoscope 2208B

Having learned a lot, I thought to document the process for others to learn from, or maybe even hijack to analyze their smart remotes. In this first part, I will cover the process with Arduino Uno, and the following posts will go through the other three methods.

Starting Simple: Arduino and 433 MHz receiver

Having purchased a rather basic Hope Microelectronics (RFM210LCF-433D) 3.3V receiver for the 433 MHz spectrum signals, it was easy to wire to Arduino:

  1. Connect GND and 3.3V outputs from Arduino to GND and VCC
  2. Connect Arduino PIN 8 to DATA on the receiver
  3. Connect a fourth "enable" pin to GND as well to turn the receiver on

You can see the setup here (larger version):

Arduino wired to Hope 433 MHz rf receiver

I wrote a simple Arduino script that measures the PIN 8 voltage every 50 microseconds (20 kHz), recording the length of HIGH/LOW pulses in a unsigned short array. Due to memory limitation of 2 kB, there is only space for about 850 edges, and the maximum length of a single edge is about 65 000 samples, i.e. bit more than three seconds.

Once the buffer is filled with edge data or maximum "silence" is reached, the code prints out the data over serial, resets the buffer and starts again, blinking a LED for 5 seconds so you know when you should start pressing those remote control buttons. Or perhaps "press a button", as at least my Nexa pretty much fills the buffer with a single key press, as it sends the same data of about 130 edges a minimum of 5 times, taking almost 700 edges!

It also turned out that the "silence" limit is rarely reached, as the Hope receiver is pretty good at catching stray signals from other places when there is nothing transmitting nearby (it likely has automatic sensitivity to "turn up the volume" if it doesn't hear anything).

Read post

Creating arbitrarily hard PBKDF2 keys with Bitcoin inspired difficulty factor

In recent years, the use of graphics processing units (GPUs) has led to the adoption of methods like PBKDF2 (Password-Based Key Derivation Function 2) for secure password storage. PBKDF2 is a key derivation function that is designed to be computationally expensive in order to slow down dictionary attacks and other brute force attacks on passwords. With the increase in processing power that GPUs provide, PBKDF2 has become a popular choice for password storage.

As the development of processing power continues to advance, it has become necessary to increase the number of iterations used in PBKDF2 in order to maintain a high level of security. With more iterations, it becomes even more difficult for an attacker to crack a password using brute force methods.

Recently, I had an idea. What if it were possible to run PBKDF2 arbitrarily long and print out points that match certain criteria? This could potentially provide an even higher level of security for password storage, as the number of iterations could be increased to levels that would make brute force attacks infeasible. It's an idea worth exploring and I'm excited to see what the future holds for PBKDF2 and other password security measures.

Bitcoin difficulty

One of the key features of the Bitcoin network is its use of difficulty to scale the hardness of block signing based on the number of computers that are currently mining. In other words, as more computers join the network and begin trying to solve the cryptographic puzzles required to add new blocks to the blockchain, the difficulty of these puzzles increases in order to maintain a consistent rate of block creation. This ensures that the network remains secure and resistant to attacks, even as the number of miners grows over time.

The basic idea behind this technique is fairly simple: by requiring that a certain number of zeros be added to the block hash, the complexity of the puzzle increases in powers of two. Every hash is essentially random, and modifying the hashed data by the tiniest bit results in a new hash. Every other hash ends in zero, and every other in one. With two zero bits, it's every 4th. To zero a full byte (8 bits) you already need 256 (2^8) tries. With three bytes, it's already close to 17 million.

Printing out PBKDF2 steps at deterministic points

Combining the two ideas is one way to deterministically create encryption keys of increasing difficulty:

Read post

How to calculate PBKDF2 HMAC SHA256 with Python, example code

Having just spent 4 hours trying to get a Python pseudocode version of PBKDF2 to match with hashlib.pbkdf2_hmac() output, I thought I'll post Yet Another Example how to do it. I thought I could just use hashlib.sha256 to calculate the steps, but turns out HMAC is not just a concatenation of password, salt and counter.

So, without further ado, here's a 256 bit key generation with password and salt:

import hashlib, hmac

def pbkdf2(pwd, salt, iter):
    h = hmac.new(pwd, digestmod=hashlib.sha256) # create HMAC using SHA256
    m = h.copy() # calculate PRF(Password, Salt+INT_32_BE(1))
    m.update(salt)
    m.update(b'\x00\x00\x00\x01')
    U = m.digest()
    T = bytes(U) # copy
    for _ in range(1, iter):
        m = h.copy() # new instance of hmac(key)
        m.update(U) # PRF(Password, U-1)
        U = m.digest()
        T = bytes(a^b for a,b in zip(U,T))
    return T

pwd = b'password'
salt = b'salt'

# both should print 120fb6cffcf8b32c43e7225256c4f837a86548c92ccc35480805987cb70be17b
print(pbkdf2(pwd, salt, 1).hex())
print(hashlib.pbkdf2_hmac('sha256', pwd, salt, 1).hex())

# both should print c5e478d59288c841aa530db6845c4c8d962893a001ce4e11a4963873aa98134a
print(pbkdf2(pwd, salt, 4096).hex())
print(hashlib.pbkdf2_hmac('sha256', pwd, salt, 4096).hex())

Getting from pseudocode to actual working example was surprisingly hard, especially since most implementations on the web are on lower level languages, and Python results are mostly just using a library.

Simplifying the pseudo code further

If you want to avoid the new...update...digest and skip the hmac library altogether, the code becomes even simpler. HMAC is quite simple to implement with Python. Here's gethmac function hard-coded to SHA256 and an even shorter pbkdf2:

Read post

WebSocket Magic: Create a Simple Server and Client in Go and JavaScript

Title image, generated with Stable Diffusion

WebSocket is a protocol that allows for real-time, bidirectional communication between a client and a server. It is often used in web applications to enable features such as chat, live updates, and multiplayer games.

In this tutorial, I will show you how to create a minimalistic WebSocket server using Go and the nhooyr websocket library, and a JavaScript client to test it out. You will learn how to handle WebSocket connections, send and receive messages, and close the connection when necessary.

By the end of this tutorial, you will have a working WebSocket server and client that you can use as a starting point for your own WebSocket-based applications.

Setting up the project

You should first set up a simple "Hello world" go project, something along the lines of this tutorial. After you have a project going, let's install nhooyr.io/websocket WebSocket library (Go's own seems deprecated and Gorilla development has ceased some years ago):

$ go get nhooyr.io/websocket

The whole system will consist of main.go that will contain a simple net/http server that will:

  1. Serve a simple WebSocket echo server at /echo
  2. Serve static files from static subfolder – essentially other addresses including / will try content from there. We'll put index.html under that subfolder.

Basic webserver stuff:

func main() {
	address := "localhost:1234"
	http.HandleFunc("/echo", echoHandler)
	log.Printf("Starting server, go to http://%s/ to try it out!", address)
	http.Handle("/", http.FileServer(http.Dir("static")))
	err := http.ListenAndServe(address, nil)
	log.Fatal(err)
}

Now the echoHandler will do a few essential items:

  1. Upgrade the connection into a WebSocket one with websocket.Accept
  2. Log errors and defer connection close in case of errors
  3. Loop forever (or actually 10 minutes in this sample), reading messages from the socket and writing them back.

Note that I've used InsecureSkipVerify to accept connections from any origin, you might want to modify the code for a tighter policy:

Read post

Seeed Studio XIAO nRF52840 (Sense) test drive on Arduino

Seeed Studio XIAO nRF52840

I have to confess I have a thing for small prototyping boards, especially ones with Bluetooth or WLAN connectivity. So when I was offered the opportunity to get a couple of Seeed Studio's tiny Bluetooth devboards with Nordic's nRF52840 in them to try out, I jumped at the opportunity. So full disclosure, I did not buy these myself, but neither did I get any compensation, so what follows will be rather unbiased first impressions! I will cover:

  1. The basic specifications of the two units
  2. How to (re)program the device with Arduino
  3. Help to troubleshoot upload.tool.serial errors on Arduino
  4. Tips and notes on using the USB mass storage mode
  5. Initial summary

I'm interested in trying out the PDM microphone, accelerometer and BLE functionality later on, so check back for updates!

Basic specifications of the Seeed XIAO BLE nrf52840

The Seeed XIAO BLE units come in two varieties, both sharing quite beefy specs:

  • Bluetooth 5.0 with an onboard antenna
  • Nordic nRF52840, ARM Cortex-M4 32-bit processor with FPU, 64 MHz
  • Low power consumption and battery charging chip for untethered IoT use cases
  • Onboard 2 MB flash

Additionally, the Sense variant contains a PDM microphone and a 6-axis accelerometer. The units arrived from China quite quickly and came in sweet little Seeed plastic packages, pin headers included (not soldered in):

Seeed Studio XIAO nRF52840 and Sense

You can get both directly from Seeed, with very reasonable $9.90 and $15.99 price points. Nordic's chips are quite hard to source from AliExpress cheaply (yes I have looked :), so I'd consider both pretty much a bargain.

Board quality seems very good, pads are shiny and components well placed. The USB port is of modern USB-C variety, and the form factor is really small, just 20 x 17.5 mm or the size of a nickel x dime. and the thickness of a half dollar or so (U.S. readers, you're welcome!). The PCB is one-sided which makes it easy to embed in various configurations.

Outside differences of the basic model and Sense variant is one additional chip that contains the PDM microphone. I think the accelerometer is hidden inside the (seemingly FCC and CE compliant) shielding.

Seeed Studio XIAO nRF52840

There is also an absurdly tiny reset button on the opposite corner to the microphone pad (top left above) that is a bit tricky to press. I'd prefer a slightly larger one, but it beats shorting pins any day.

Classic blink test with Arduino

You can follow the instructions on Seeed Studio wiki to install the necessary development tools to build firmware for the device. Short version:

  1. Get Arduino
  2. In File > Preferences, add https://files.seeedstudio.com/arduino/package_seeeduino_boards_index.json to Additional Boards Manager URLs
  3. Go to Tools > Board > Boards Manager, search for "seeed nrf52" and install the two libraries.
  4. Now you can select your board and port.

Read post