So, I am using this clip model for some labelling task. But when I use the clip model's text encoder, it gives the following error:
<ipython-input-117-4c513cc2d787> in forward(self, batch)
34 print(y.size())
35 print(y.dim())
---> 36 y = self.text_encoder(y)
37 y = self.classifier(y)
38
/usr/local/lib/python3.10/dist-packages/clip/model.py in encode_text(self, text)
345 x = x + self.positional_embedding.type(self.dtype)
346 x = x.permute(1, 0, 2) # NLD -> LND
--> 347 x = self.transformer(x)
348 x = x.permute(1, 0, 2) # LND -> NLD
349 x = self.ln_final(x).type(self.dtype)
RuntimeError: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 4 is not equal to len(dims) = 3
The thing is, the labels are multiple for one image, so I am using a collate_fn with pad_sequence in the dataloader before feeding into the model.
def pad_sequence(batch):
return torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0)
def my_collate_fn(batch):
batch['i'] = torch.stack(batch['i'].float())
batch['l'] = pad_sequence(batch['l'].long())
return batch
class CustomCLIP(torch.nn.Module):
def __init__(self, num_classes: int = 10, bias=False):
super().__init__()
#model, _ = clip.load("RN50")
def forward(self, batch):
x = batch['i']
x = self.encoder(x)
x = self.classifier(x)
y = batch['l']
print(y)
print(y.size())
print(y.dim())
y = self.text_encoder(y) #error on this line
y = self.classifier(y)
x_similarity = x @ x.T
y_similarity = y @ y.T
targets = F.softmax(
(x_similarity + y_similarity) / 2 * temperature, dim=-1
)
outputs = (y @ x.T) / temperature
return outputs, targets
I have printed out the dimensions of y
. its 3 which matches the length dimension. then why is it giving error that the input tensor dimension is 4?
[[49406, 332, 49407, ..., 0, 0, 0],
[49406, 320, 49407, ..., 0, 0, 0],
[49406, 333, 49407, ..., 0, 0, 0],
...,
[ 0, 0, 0, ..., 0, 0, 0],
[ 0, 0, 0, ..., 0, 0, 0],
[ 0, 0, 0, ..., 0, 0, 0]]], device='cuda:0')
torch.Size([32, 392, 77])
3
Someone please point out whats the issue and how to solve it. Thanks in advance.
I solved it by using squeeze() operation on the tensor to match the desired length of dimension which is 3, where the input was 4. I checked the input shape at first and it was actually 4.