I am working on an inference model of a pytorch onnx model which is why this question is being asked.
Assume, I have a image with dimensions 32 x 32 x 3
(CIFAR-10 dataset). I pass it through a Conv2d with dimensions : 3 x 192 x 5 x 5
. The command I used is: Conv2d(3, 192, kernel_size=5, stride=1, padding=2)
Using the formula (stated here for reference pg12 https://arxiv.org/pdf/1603.07285.pdf) I should be getting an output image with dimensions 28 x 28 x 192
(input - kernel + 1 = 32 - 5 + 1
).
Question is how has PyTorch implemented this 4d tensor 3 x 192 x 5 x 5
to get me an output of 28 x 28 x 192
? The layer is a 4d tensor and the input image is a 2d one.
How is the kernel (5x5
) spread in the image matrix 32 x 32 x 3
? What does the kernel convolve with first -> 3 x 192
or 32 x 32
?
Note : I have understood the 2d aspects of things. I am asking the above questions in 3 or more.
The input to Conv2d is a tensor of shape (N, C_in, H_in, W_in)
and the output is of shape (N, C_out, H_out, W_out)
, where N
is the batch size (number of images), C
is the number of channels, H
is the height and W
is the width. The output height and width H_out
, W_out
are computed as follows (ignoring the dilation):
H_out = (H_in + 2*padding[0] - kernel_size[0]) / stride[0] + 1
W_out = (W_in + 2*padding[1] - kernel_size[1]) / stride[1] + 1
See cs231n for an explanation of how this formulas were obtained.
In your example N=1, H_in = 32, W_in = 32, C_in = 3, kernel_size = (5, 5), strides = (1, 1), padding = (0, 0)
, giving H_out = 28, W_out = 28
.
The C_out=192
means that there are 192 different filters, each of shape (C_in, kernel_size[0], kernel_size[1]) = (3, 5, 5)
. Each filter independently performs convolution with the input image resulting in a 2D tensor of shape (H_out, W_out) = (28, 28)
, and since there are C_out = 192
filters and N = 1
images, the final output is of shape (N, C_out, H_out, W_out) = (1, 192, 28, 28)
.
To understand how exactly the convolution is performed see the convolution demo.