I am attempting to understand more about computer vision models, and I'm trying to do some exploring of how they work. In an attempt to understand how to interpret feature vectors more I'm trying to use Pytorch to extract a feature vector. Below is my code that I've pieced together from various places.
import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
from torch.autograd import Variable
from PIL import Image
img=Image.open("Documents/01235.png")
# Load the pretrained model
model = models.resnet18(pretrained=True)
# Use the model object to select the desired layer
layer = model._modules.get('avgpool')
# Set model to evaluation mode
model.eval()
transforms = torchvision.transforms.Compose([
torchvision.transforms.Resize(256),
torchvision.transforms.CenterCrop(224),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
def get_vector(image_name):
# Load the image with Pillow library
img = Image.open("Documents/Documents/Driven Data Competitions/Hateful Memes Identification/data/01235.png")
# Create a PyTorch Variable with the transformed image
t_img = transforms(img)
# Create a vector of zeros that will hold our feature vector
# The 'avgpool' layer has an output size of 512
my_embedding = torch.zeros(512)
# Define a function that will copy the output of a layer
def copy_data(m, i, o):
my_embedding.copy_(o.data)
# Attach that function to our selected layer
h = layer.register_forward_hook(copy_data)
# Run the model on our transformed image
model(t_img)
# Detach our copy function from the layer
h.remove()
# Return the feature vector
return my_embedding
pic_vector = get_vector(img)
When I do this I get the following error:
RuntimeError: Expected 4-dimensional input for 4-dimensional weight [64, 3, 7, 7], but got 3-dimensional input of size [3, 224, 224] instead
I'm sure this is an elementary error, but I can't seem to figure out how to fix this. It was my impression that the "totensor" transformation would make my data 4-d, but it seems it's either not working correctly or I'm misunderstanding it. Appreciate any help or resources I can use to learn more about this!
All the default nn.Modules
in pytorch expect an additional batch dimension. If the input to a module is shape (B, ...) then the output will be (B, ...) as well (though the later dimensions may change depending on the layer). This behavior allows efficient inference on batches of B inputs simultaneously. To make your code conform you can just unsqueeze
an additional unitary dimension onto the front of t_img
tensor before sending it into your model to make it a (1, ...) tensor. You will also need to flatten
the output of layer
before storing it if you want to copy it into your one-dimensional my_embedding
tensor.
A couple of other things:
You should infer within a torch.no_grad()
context to avoid computing gradients since you won't be needing them (note that model.eval()
just changes the behavior of certain layers like dropout and batch normalization, it doesn't disable construction of the computation graph, but torch.no_grad()
does).
I assume this is just a copy paste issue but transforms
is the name of an imported module as well as a global variable.
o.data
is just returning a copy of o
. In the old Variable
interface (circa PyTorch 0.3.1 and earlier) this used to be necessary, but the Variable
interface was deprecated way back in PyTorch 0.4.0 and no longer does anything useful; now its use just creates confusion. Unfortunately, many tutorials are still being written using this old and unnecessary interface.
Updated code is then as follows:
import torch
import torchvision
import torchvision.models as models
from PIL import Image
img = Image.open("Documents/01235.png")
# Load the pretrained model
model = models.resnet18(pretrained=True)
# Use the model object to select the desired layer
layer = model._modules.get('avgpool')
# Set model to evaluation mode
model.eval()
transforms = torchvision.transforms.Compose([
torchvision.transforms.Resize(256),
torchvision.transforms.CenterCrop(224),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
def get_vector(image):
# Create a PyTorch tensor with the transformed image
t_img = transforms(image)
# Create a vector of zeros that will hold our feature vector
# The 'avgpool' layer has an output size of 512
my_embedding = torch.zeros(512)
# Define a function that will copy the output of a layer
def copy_data(m, i, o):
my_embedding.copy_(o.flatten()) # <-- flatten
# Attach that function to our selected layer
h = layer.register_forward_hook(copy_data)
# Run the model on our transformed image
with torch.no_grad(): # <-- no_grad context
model(t_img.unsqueeze(0)) # <-- unsqueeze
# Detach our copy function from the layer
h.remove()
# Return the feature vector
return my_embedding
pic_vector = get_vector(img)