python-3.xtensorflow2.0huggingface-transformers

How to get text and image embedding of same dimension using Huggingface CLIP


I am using TFCLIPTextModel and TFCLIPVisionModel to get embedding of texts and images that I need for some downstream tasks. I want the embedding to share the same dimensional space as they do in CLIP. However, as the documentation of these two models suggests the hidden_size of TFCLIPTextModel is 512 while for TFCLIPVisionModel it is 768. So while I extract the last hidden state from these two models, I get embeddings of different dimensions. I am also aware of the projection_dim which is 512 for both these models, but I don't know how to extract the projected features.

Here is my code for the embedding extraction for image and texts.

def Image_Embedding_Generator(images, batch_size=32):
    model_name = "openai/clip-vit-base-patch32"
    model = TFCLIPVisionModel.from_pretrained(model_name)
    processor = CLIPProcessor.from_pretrained(model_name)

    if isinstance(images, (np.ndarray, tf.Tensor)):
        images = tf.unstack(images) if len(images.shape) == 4 else [images]
    elif isinstance(images, dict):
        images = [image for _, image in images.items()]

    # inputs = processor(images = images, return_tensors = "tf", padding = True, rescaling = False)
    
    # dataset = tf.data.Dataset.from_tensor_slices((inputs['pixel_values'], inputs['attention_mask']))
    # dataset = dataset.batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)

    image_embeddings = []
    cls_embeddings = []

    pbar = trange(0, len(images), batch_size, desc = "Generating Image Embeddings")

    for i in range(0, len(images), batch_size):
        image_batch = images[i:i+batch_size]
        inputs = processor(images = image_batch, return_tensors = "tf", do_rescale = False)
        outputs = model(**inputs, output_hidden_states = True)
        print(outputs)
        batch_embeddings = outputs.last_hidden_state.numpy()  # Convert to numpy array
        print(batch_embeddings.shape)

        pooled_embeddings = tf.reduce_mean(batch_embeddings, axis = 1)

        image_embeddings.append(pooled_embeddings)
        cls_embeddings.append(outputs.pooler_output)
        pbar.update(batch_size)
    image_embeddings = np.concatenate(image_embeddings, axis=0)
    cls_embeddings = np.concatenate(cls_embeddings, axis=0)

    return cls_embeddings, image_embeddings


def Text_Embedding_Generator(texts, batch_size = 32):
    model_name = "openai/clip-vit-base-patch32"
    model = TFCLIPTextModel.from_pretrained(model_name)
    tokenizer = CLIPTokenizer.from_pretrained(model_name)

    if isinstance(texts, str):
        texts = [texts]
    elif isinstance(texts, dict):
        texts = [text for _, text in texts.items()]
    elif isinstance(texts, Iterable):
        texts = texts

    inputs = tokenizer(text = texts, return_tensors = "tf", padding = "max_length", truncation = True, max_length = 256)

    dataset = tf.data.Dataset.from_tensor_slices((inputs['input_ids'], inputs['attention_mask']))
    dataset = dataset.batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)

    text_embeddings = []
    cls_embeddings = []

    for batch in tqdm(dataset, desc = "Generating Text Embeddings"):
        batch_inputs = {'input_ids': batch[0], 'attention_mask': batch[1]}
        outputs = model(**batch_inputs)
        batch_embeddings = outputs.last_hidden_state.numpy()  # Convert to numpy array)

        attention_mask_expanded = tf.cast(batch_inputs['attention_mask'], dtype=tf.float32)[:, :, None]
        sum_embeddings = tf.reduce_sum(batch_embeddings * attention_mask_expanded, axis=1)
        sum_mask = tf.reduce_sum(attention_mask_expanded, axis=1)
        pooled_embeddings = sum_embeddings / sum_mask

        text_embeddings.append(pooled_embeddings)
        cls_embeddings.append(outputs.pooler_output)
    
    text_embeddings = np.concatenate(text_embeddings, axis=0)

    return cls_embeddings, text_embeddings

As said earlier the cls_embeddings and text_embeddings from Text_Embedding_Generator have shape (batch_size, 512) while those from Image_Embedding_Generator have shape (batch_size, 768).

Is there a way to get these two embeddings to same dimensions without needing to train an extra layer on top of these embeddings?


Solution

  • I figured out the intricacies. For TensorFlow users who are wondering how to get the image and text embeddings to the same dimension without needing to train an extra layer on top of the pre-trained CLIP, here is the way to go.

    Instead of using the text and vision models separately, which projects the image and text to different dimensions, we are going to use the combined TFCLIPModel which embeds the text and image together to the same dimension. Following this link, and the example underneath it, I came to the following minimal code to get the embeddings to the same dimension.

    import tensorflow as tf
    from PIL import Image
    import requests
    from transformers import AutoProcessor, TFCLIPModel
    
    model = TFCLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    inputs = processor(
        text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="tf", padding=True
    )
    
    outputs = model(**inputs)
    

    Now printing outputs.keys() yields:

    dict_keys(['logits_per_image', 'logits_per_text', 'text_embeds', 'image_embeds', 'text_model_output', 'vision_model_output'])
    

    I will explain each of the above outputs below

    (1) logits_per_image: This is the image-text similarity score, which is basically a cosine similarity weighted by a temperature factor (usually in the range 0.07 to 0.1). Shape if this is (image batch size x text batch size)

    (2) logits_per_text: This is the text image similarity score. Similar to the image-text similarity score. Shape of this is (text batch size x image batch size)

    (3) text_embeds: This is the text embedding projected to some dimension d. For pretrained CLIP d is 512.

    (4) image_embeds: This is the image embedding projected to the same dimension as the text embedding, i.e., d which is 512 for pre-trained CLIP.

    (5) text_model_output: This is the output of the underlying text model. So basically while I was doing TFCLIPTextModel.from_pretrained(...) in my text_embedding_generator, I was inadvertently using the output of the text model. Usually for a pre-trained CLIP, the text model output is of dimension 512.

    (6) image_model_output: Similar to above but this is the output of the underlying image model which in my case was a ViT. So this output is of dimension 768.

    So, to summarize, image_embeds and text_embeds are what we are looking for and for the lazy readers,

    In order to get the embeddings to the same dimension use TFCLIPModel and NOT the Vision/Text models

    In order to get the vision/text outputs ONLY, one may use the Vision and Text models individually although I don't see a reason to use them as the combined model also gives them.