I am trying to understand how the dimensions in convolutional neural network behave. In the figure below the input is 28-by-28 matrix with 1 channel. Then there are 32 5-by-5 filters (with stride 2 in height and width). So I understand that the result is 14-by-14-by-32. But then in the next convolutional layer we have 64 5-by-5 filters (again with stride 2). So why the result is 7-by-7- by 64 and not 7-by-7-by 32*64? Aren't we applying each one of the 64 filters to each one of the 32 channels?
One filter is the sum of all the dimensions in the previous layer. This means that the 5x5 filter sums up over all 32 dimensions and in essence is a weighted sum of 32*5*5 values. However the weight values are shared across dimensions. Then there are 64 such filters. A better explanation with images can be found here: http://cs231n.github.io/convolutional-networks/.