pythonmachine-learningdeep-learningpytorchtransformer-model

vision transformers: RuntimeError: mat1 and mat2 shapes cannot be multiplied (32x1000 and 768x32)


I am trying to do Regression on the vision transformers model and I cannot replace the last layer of classification with the regression layer

class RegressionViT(nn.Module):
    def __init__(self, in_features=224 * 224 * 3, num_classes=1, pretrained=True):
        super(RegressionViT, self).__init__()
        self.vit_b_16 = vit_b_16(pretrained=pretrained)
        # Accessing the actual output feature size from vit_b_16
        self.regressor = nn.Linear(self.vit_b_16.heads[0].in_features, num_classes * batch_size)

    def forward(self, x):
        x = self.vit_b_16(x)
        x = self.regressor(x)
        return x


# Model
model = RegressionViT(num_classes=1)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

criterion = nn.MSELoss()  # Use appropriate loss function for regression
optimizer = optim.Adam(model.parameters(), lr=0.0001)

I get this error when I try to initialize and run the model

RuntimeError: mat1 and mat2 shapes cannot be multiplied (32x1000 and 768x32)

The problem is that there is a mismatch between the regression layer and the vit_b_16 model layer, what would be the correct way to solve this issue


Solution

  • If you look into the source code of VisionTransformer, you will notice in this section that self.heads is a sequential layer, not a linear layer. By default, it only contains a single layer head corresponding to the final classification layer. To overwrite this layer, you can do:

    heads = self.vit_b_16.heads
    heads.head = nn.Linear(heads.head.in_features, num_classes)