Good morning everyone
Below is my implementation of a pytorch siamese network. I am using 32 batch size, MSE loss and SGD with 0.9 momentum as optimizer.
class SiameseCNN(nn.Module):
def __init__(self):
super(SiameseCNN, self).__init__() # 1, 40, 50
self.convnet = nn.Sequential(nn.Conv2d(1, 8, 7), nn.ReLU(), # 8, 34, 44
nn.Conv2d(8, 16, 5), nn.ReLU(), # 16, 30, 40
nn.MaxPool2d(2, 2), # 16, 15, 20
nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(), # 32, 15, 20
nn.Conv2d(32, 64, 3, padding=1), nn.ReLU()) # 64, 15, 20
self.linear1 = nn.Sequential(nn.Linear(64 * 15 * 20, 100), nn.ReLU())
self.linear2 = nn.Sequential(nn.Linear(100, 2), nn.ReLU())
def forward(self, data):
res = []
for j in range(2):
x = self.convnet(data[:, j, :, :])
x = x.view(-1, 64 * 15 * 20)
res.append(self.linear1(x))
fres = abs(res[1] - res[0])
return self.linear2(fres)
Each batch contains alternating pairs, i.e [pos, pos], [pos, neg], [pos, pos]
etc... However, the network doesn't converge, and the problem seems that fres
in the network is the same for each pair (regardless of whether it is a positive or negative pair), and the output of self.linear2(fres)
is always approximately equal to [0.0531, 0.0770]
. This is in contrast with what I am expecting, which is that the first value of [0.0531, 0.0770]
would get closer to 1 for a positive pair as the network learns, and the second value would get closer to 1 for a negative pair. These two values also need to sum up to 1.
I have tested exactly the same setup and same input images for a 2 channel network architecture, where, instead of feeding in [pos, pos]
you would stack those 2 images in a depth-wise fashion, for example numpy.stack([pos, pos], -1)
. The dimension of nn.Conv2d(1, 8, 7)
also changes to nn.Conv2d(2, 8, 7)
in this setup. This works perfectly fine.
I have also tested exactly the same setup and input images for a traditional CNN approach, where I just pass in single positive and negative grey scale images into the network, instead of stacking them (as with the 2-CH approach) or passing them in as image pairs (as with the Siamese approach). This also works perfectly, but the results are not so good as with the 2 channel approach.
EDIT (Solutions I've tried):
def forward(self, data):
res = []
for j in range(2):
x = self.convnet(data[:, j, :, :])
x = x.view(-1, 64 * 15 * 20)
res.append(x)
fres = self.linear2(self.linear1(abs(res[1] - res[0]))))
return fres
torch.nn.PairwiseDistance
between the outputs of convnet
. Made some sort of improvement; the network starts to converge for the first few epochs, and then hits the same plateau everytime:def forward(self, data):
res = []
for j in range(2):
x = self.convnet(data[:, j, :, :])
res.append(x)
pdist = nn.PairwiseDistance(p=2)
diff = pdist(res[1], res[0])
diff = diff.view(-1, 64 * 15 * 10)
fres = self.linear2(self.linear1(diff))
return fres
Another thing to note perhaps is that, within the context of my research, a Siamese network is trained for each object. So the first class is associated with the images containing the object in question, and the second class is associated with images containing other objects. Don't know if this might be the cause of the problem. It is however not a problem within the context of the Traditional CNN and 2-Channel CNN approaches.
As per request, here is my training code:
model = SiameseCNN().cuda()
ls_fn = torch.nn.BCELoss()
optim = torch.optim.SGD(model.parameters(), lr=1e-6, momentum=0.9)
epochs = np.arange(100)
eloss = []
for epoch in epochs:
model.train()
train_loss = []
for x_batch, y_batch in dp.train_set:
x_var, y_var = Variable(x_batch.cuda()), Variable(y_batch.cuda())
y_pred = model(x_var)
loss = ls_fn(y_pred, y_var)
train_loss.append(abs(loss.item()))
optim.zero_grad()
loss.backward()
optim.step()
eloss.append(np.mean(train_loss))
print(epoch, np.mean(train_loss))
Note dp
in dp.train_set
is a class with attributes train_set, valid_set, test_set
, where each set is created as follows:
DataLoader(TensorDataset(torch.Tensor(x), torch.Tensor(y)), batch_size=bs)
As per request, here is an example of the predicted probabilities vs true label, where you can see the model doesn't seem to be learning:
Predicted: 0.5030623078346252 Label: 1.0
Predicted: 0.5030624270439148 Label: 0.0
Predicted: 0.5030624270439148 Label: 1.0
Predicted: 0.5030625462532043 Label: 0.0
Predicted: 0.5030625462532043 Label: 1.0
Predicted: 0.5030626654624939 Label: 0.0
Predicted: 0.5030626058578491 Label: 1.0
Predicted: 0.5030627250671387 Label: 0.0
Predicted: 0.5030626654624939 Label: 1.0
Predicted: 0.5030627846717834 Label: 0.0
Predicted: 0.5030627250671387 Label: 1.0
Predicted: 0.5030627846717834 Label: 0.0
Predicted: 0.5030627250671387 Label: 1.0
Predicted: 0.5030628442764282 Label: 0.0
Predicted: 0.5030627846717834 Label: 1.0
Predicted: 0.5030628442764282 Label: 0.0
Problem solved. Turns out the network will predict the same output every time if you give it the same images every time 🤣 Small indexing mistake on my part during data partitioning. Thanks for everyone's help and assistance. Here is an example of the convergence as it is now:
0 0.20198837077617646
1 0.17636818194389342
2 0.15786472541093827
3 0.1412761415243149
4 0.126698794901371
5 0.11397973036766053
6 0.10332610329985618
7 0.09474560652673245
8 0.08779258838295936
9 0.08199785630404949
10 0.07704121413826942
11 0.07276330365240574
12 0.06907484836131335
13 0.06584368328005076
14 0.06295975042134523
15 0.06039590438082814
16 0.058096024941653016