machine-learningdeep-learningpytorchocrresnet

Using ResNet50 to create a feature tensor of [w, h, f]


I'm trying to implement this paper but I'm not following something in it.

It wants me to use ResNet50 to extract features from an image but tells me the extracted features will be of dimension [w, h, f]. Everything I'm seeing with ResNet50, though, is giving me back a tensor of [f] (as in, it turns my whole image into features and not my pixels into features)

Am I reading this wrong or do I just not understand what I'm supposed to be doing with ResNet50?

Relevant quotes from paper: "We obtain an intermediate visual feature representation Fc of size f. We use the ResNet50 [26] as our backbone convolutional architecture."

"In a first step, the three-dimensional feature Fc is reshaped into a two-dimensional feature by keeping its width, i.e. obtaining a feature shape (f × h, w)."


Solution

  • I didn't read the paper in detail, but when they say [w, h, f] I don't think the w and h have to match the width and height of the original image. They likely just mean that if the output of your ResNet after the last Conv + Pooling layer is [w, h, f], you reshape it into 2d (making it it [fxh, w]) and then pass it through a fully-connected layer to make it f dimensional.

    Something like this

    import torch
    import torch.nn as nn
    import torchvision.models as models
    
    resnet = models.resnet50(pretrained=True)
    
    # Remove the last fully connected layer and adaptive pooling layers
    resnet = torch.nn.Sequential(*list(resnet.children())[:-2])
    
    # Dummy image of shape [1, 3, 224, 224]
    image = torch.randn(1, 3, 224, 224)
    
    intermediate_features = resnet(image)  # This will be [1, 2048, 7, 7]
    
    batch_size, channels, h, w = intermediate_features.size()
    
    # [1, 14336, 7] where f=14336 and w=7
    reshaped_features = intermediate_features.view(batch_size, channels * h, w)
    
    fc_layer = nn.Linear(w, 1)  # This layer reduces the w dimension to 1
    
    final_output = fc_layer(reshaped_features)  # [1, 14336, 1]
    
    final_output = final_output.squeeze(-1)  # [1, 14336]
    
    print(final_output.shape)
    
    

    (My example also has batch size as a dimension because in the real world you work with batches of examples)