I spent one week trying to solve problem with siamese pytorch model. I will explain all in details. My model takes two anime faces images as an input, model should say if they are similar. Here is random 10 images for better understanding:
I used Adam as an optimizer (also tried SGD), Contrastive loss function, stepLR scheduler, differend batch sizes (had no effect), different learning rates and early stopping.
I tried to use ResNet34 and VGG16 models. But it doesn't change the result. Model wasn't converge.
While training the loss was rapidly decreasing, but then it keeps around 1 and doesn't decrease. If stop training and try to use lower learning rate, it also won't decrease.
Dataset is large! Over 200 different characters and at least 80 images for every character (14748 images in total). I made 14703 similar and 14734 dissimilar pairs from them, so that each image doesn't duplicates more than 2 times.
It's my transforms (I tried to use normalize but it had no effect on result):
images_size = 224
transform = transforms.Compose([
transforms.Resize((images_size, images_size)),
transforms.Lambda(lambda x: x.convert('RGB') if x.mode == 'RGBA' else x), # Convert images to 3-channel (RGB)
transforms.RandomHorizontalFlip(),
transforms.RandomRotation((-20,20)),
transforms.ToTensor(),
])
I thought that the problem can be in dropouts in vgg16 model, so I remove them. But it doesn't help. Also there was 1000 neurons at last fc layer, so I decreased that number to 100. Also doesn't help. Now my model is:
class SiameseNetwork(nn.Module):
def __init__(self):
super(SiameseNetwork, self).__init__()
self.base_model = models.vgg16(weights=models.VGG16_Weights.DEFAULT)
self.base_model.classifier = nn.Sequential(
nn.Linear(in_features=25088, out_features=4096, bias=True),
nn.ReLU(inplace=True),
nn.Linear(in_features=4096, out_features=1000, bias=True),
nn.ReLU(inplace=True),
nn.Linear(in_features=1000, out_features=100, bias=True)
)
def forward_once(self, x):
# Pass one image through the base model
return self.base_model(x)
def forward(self, image1, image2):
# Get the feature vectors for both images
output1 = self.forward_once(image1)
output2 = self.forward_once(image2)
return output1, output2
model = SiameseNetwork()
model.type(torch.cuda.FloatTensor)
model.to(device)
It's my loss function:
class ContrastiveLoss(nn.Module):
def __init__(self, margin=2.0):
super(ContrastiveLoss, self).__init__()
self.margin = margin
def forward(self, output1, output2, label):
euclidean_distance = F.pairwise_distance(output1, output2, keepdim = True)
loss_contrastive = torch.mean((1-label) * torch.pow(euclidean_distance, 2) +
(label) * torch.pow(torch.clamp(self.margin - euclidean_distance, min=0.0), 2))
return loss_contrastive
criterion = ContrastiveLoss()
I tried to use different margins. I noticed that if I make margin lower (like 0.5) loss become very small (lower than 1), but it makes the output distances very small.
Of course, I thought that some faces can be very similar to each other. So, I tried to overfit my model. Instead of using large dataset of anime faces I tried to use this small dataset on kaggle. It contains faces of 5 avengers actors (274 images).
While training loss stopped decreasing at around 1. I had the same problem.
After training I tested this on 20 random images in dataset. Here you can see distances and predictions (I set threshold to 0.6)
tensor([0.6290], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.3255], device='cuda:0') tensor([True], device='cuda:0')
tensor([1.1670], device='cuda:0') tensor([False], device='cuda:0')
tensor([1.2781], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.8160], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.9014], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.7134], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.7407], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.8605], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.5967], device='cuda:0') tensor([True], device='cuda:0')
tensor([1.2125], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.9514], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.7146], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.6037], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.7502], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.6506], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.9127], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.8530], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.6156], device='cuda:0') tensor([False], device='cuda:0')
tensor([0.8985], device='cuda:0') tensor([False], device='cuda:0')
Correct: 10, Wrong: 10, Accuracy: 50.0000
And images (GT - ground truth label, P - model prediction and distances):
Please explain me what I am doing wrong. Feel free to ask questions in comments.
Also I don't understand one thing. If the images are similar, should I label them 0 and if they are dissimilar, should I label them 1 or vice versa? Tried this and that, but it didn't solve the problem.
The number of output features may be too low, I would try 256 instead of 100. 256 worked well for me with triplet margin loss when doing facial recognition. However, since it did not work with 1000 the issue may very well be somewhere else.
Converting your code to math:
loss_contrastive = torch.mean((1-label) * torch.pow(euclidean_distance, 2) +
(label) * torch.pow(torch.clamp(self.margin - euclidean_distance, min=0.0), 2))
loss = (1 - Y) * d^2 + Y * max(margin - d, 0)^2
With this loss function, Y = 0 for the same class so that d^2 will be minimized and Y = 1 for different classes so that max(margin - d, 0) will be minimized.
You could try using a different model, in my experience GoogLeNet should work well.
Not everything is guaranteed to work, now it is time to find out why. Make hypothesis and test them, do science.
A starting point may be the hypothesis that in the embedding space, the distance d should be small for images from the same class (similar) and large for dissimilar images. You could plot d_similar and d_dissimilar during training, the more training proceeds the smaller should d_similar become and the larger d_dissimilar and d_similar should be smaller than d_dissimilar. Is that the case? How does this graph look? How does this come about?
You can also look at the independent parts of the loss. Maybe (1 - Y) * d^2 is getting small fast in training while Y * max(margin - d, 0)^2 stays large, e.g. because everything gets projected onto the same point in the embedding space? One way to combat this could be to do inference on a larger number of images (e.g. 8 * batchsize) and select the batchsize images with the largest loss for training.