I must implement this network:
Similar to a siamese network with a contrastive loss. My problem is S1
/F1
. The paper tells this:
"
F1
andS1
are neural networks that we use to learn the unit-normalized embeddings for the face and speech modalities, respectively. In Figure 1, we depictF1
andS1
in both training and testing routines. They are composed of 2D convolutional layers (purple), max-pooling layers (yellow), and fully connected layers (green). ReLU non-linearity is used between all layers. The last layer is a unit-normalization layer (blue). For both face and speech modalities,F1
andS1
return 250-dimensional unit-normalized embeddings".
My question is:
(number of videos, number of frames, features)
?F.normalize
?I will give an answer to your two questions without going too much into details:
If you're working with a CNN, you're most likely having spatial information in your input, that is your input is a two dimensional multi-channel tensor (*, channels, height, width)
, not a feature vector (*, features)
. You simply won't be able to apply a convolution on your input (at least a 2D conv), if you don't retain two-dimensionality.
The last layer is described as a "unit-normalization" layer. This is merely the operation of making the vector's norm unit (equal to 1). You can do this by dividing the said vector by its norm.