I am trying to understand how the MNIST example in the Matconvnet is designed. It looks like they are using a LeNet variation, but since I did not use Matconvnet before, I am having difficulties how the connection between the last convolutional layer and first fully connected layer has been established:
net.layers = {} ;
net.layers{end+1} = struct('type', 'conv', ...
'weights', {{f*randn(5,5,1,20, 'single'), zeros(1, 20, 'single')}}, ...
'stride', 1, ...
'pad', 0) ;
net.layers{end+1} = struct('type', 'pool', ...
'method', 'max', ...
'pool', [2 2], ...
'stride', 2, ...
'pad', 0) ;
net.layers{end+1} = struct('type', 'conv', ...
'weights', {{f*randn(5,5,20,50, 'single'),zeros(1,50,'single')}}, ...
'stride', 1, ...
'pad', 0) ;
net.layers{end+1} = struct('type', 'pool', ...
'method', 'max', ...
'pool', [2 2], ...
'stride', 2, ...
'pad', 0) ;
net.layers{end+1} = struct('type', 'conv', ...
'weights', {{f*randn(4,4,50,500, 'single'), zeros(1,500,'single')}}, ...
'stride', 1, ...
'pad', 0) ;
net.layers{end+1} = struct('type', 'relu') ;
net.layers{end+1} = struct('type', 'conv', ...
'weights', {{f*randn(1,1,500,10, 'single'), zeros(1,10,'single')}}, ...
'stride', 1, ...
'pad', 0) ;
net.layers{end+1} = struct('type', 'softmaxloss') ;
Usually, in libraries like Tensorflow and MxNet, the last convolutional layer is flattened and then connected to the fully connected one. Here, as far as I understand they interpret the first fully connected layer, with the weights {{f*randn(4,4,50,500, 'single'), zeros(1,500,'single')}}
as a fully connected layer, but this layer still gives a three dimensional activation map as its result. I don't see how the "flattening" happens here. I need help on how the convolutional layer-fully connected layer connection is established here.
As far as I know, you should only substitute the fully connected layer with a convolutional layer which has filters with width and height equal to the width and height of the input. And in fact, you don't need to flatten the data before fully connected layer in the Matconvnet (a flat data has got 1x1xDxN
shape). In your case, using a kernel with the same spatial size of the input, i.e. 4x4
, would operate as FC layers and its output would be 1 x 1 x 500 x B. (B stands for the fourth dimension in the input)
Updated: The architecture of the network and its outputs are visualized here to comprehend the operations flow.