link: https://www.kaggle.com/c/diabetic-retinopathy-detection/discussion/15617
Github: https://github.com/sveitser/kaggle_diabetic
Hello, I am new to CNNs and recently I am studying this solution. The author drew a table of the networks his group designed. The units, filter and stride all make sense to me, but I just don't know what the "size" means. Is it more likely to mean the batch size or the image size?
I thought it should be image size at first, but there are two reasons it should not:
As they described in their report, they just cropped the original imgaes to 128x128,256x256,512x512 pixels and didn't do any other iamge preprocessing.
After reading their codes (from the Github link) , I found their setting for the InputLayer is:
(InputLayer, {'shape': (None, 3, cnf['w'], cnf['h'])}),
which confirms the description in their competition report.
Therefore, I think the input size should be 3x128x128, instead of 448.
Here are my questions:
1.If the input image size is not 448, what does 448 mean?
2.If it means batch size, why would they choose 448? and
3.why would they let the batch size decrease(basically /2) to 224 111 56 27 13 6 2 from the 1st layer to the 19 layer?
The size column of the linked table refers to the vertical and horizontal dimensions of the activations in a layer.
These are the full configurations for the networks from the table in the repo:
Both of these have input width and height 448, ie. the size of the input layer is 448.
We can use the following formula to compute the vertical and horizontal dimensions of the activations of a convolutional layer:
ACTIVATION_SIZE = (INPUT_SIZE − FILTER_SIZE + PADDING_PREV + PADDING_AFTER) / STRIDE + 1
We can get the the input size, filter size and stride parameters from the network configs linked above. Since they use an early development version of Lasagne, it's hard to discern exactly what kind of padding they are using so we will have to make some assumptions there.
For Network A:
INPUT_SIZE = 448
FILTER_SIZE = 5
STRIDE = 2
Using the formula from above this will resolve to an activation size of 224 if PADDING_PREV = 2
and PADDING_AFTER = 1
(or the other way around). Since the size of the first convolutional layer is 224 according to their table, we can be pretty sure that we interpreted the parameters correctly.
For Network B:
INPUT_SIZE = 448
FILTER_SIZE = 4
STRIDE = 2
This will result in an activation size of 224 as in the table if both paddings are 1.
In conclusion, the authors reported the architectures of their networks for images 512x512 and omitted the details of resizing these images to 448x448 resolution and how they are applying padding. This is customary in the computer vision community and one can always rely on the above formula to verify these details.