keras deep-learning neural-network conv-neural-network

InceptionResnetV2 STEM block keras implementation mismatch the one in the original paper?

I've been trying to compare the InceptionResnetV2 model summary from Keras implementation with the one specified in their paper, and it doesn't seem to show much resemblance when it comes to the filter_concat block.

The first lines of the model summary() are as shown below. (for my case, the input is changed to 512x512, but up to my knowledge, it doesn't affect the number of filters per layer, so we can also use them to follow up the paper-code translation):

Model: "inception_resnet_v2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            (None, 512, 512, 3)  0
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 255, 255, 32) 864         input_1[0][0]
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 255, 255, 32) 96          conv2d_1[0][0]
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 255, 255, 32) 0           batch_normalization_1[0][0]
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 253, 253, 32) 9216        activation_1[0][0]
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 253, 253, 32) 96          conv2d_2[0][0]
__________________________________________________________________________________________________
activation_2 (Activation)       (None, 253, 253, 32) 0           batch_normalization_2[0][0]
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, 253, 253, 64) 18432       activation_2[0][0]
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 253, 253, 64) 192         conv2d_3[0][0]
__________________________________________________________________________________________________
activation_3 (Activation)       (None, 253, 253, 64) 0           batch_normalization_3[0][0]
__________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)  (None, 126, 126, 64) 0           activation_3[0][0]
__________________________________________________________________________________________________
conv2d_4 (Conv2D)               (None, 126, 126, 80) 5120        max_pooling2d_1[0][0]
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 126, 126, 80) 240         conv2d_4[0][0]
__________________________________________________________________________________________________
activation_4 (Activation)       (None, 126, 126, 80) 0           batch_normalization_4[0][0]
__________________________________________________________________________________________________
conv2d_5 (Conv2D)               (None, 124, 124, 192 138240      activation_4[0][0]
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 124, 124, 192 576         conv2d_5[0][0]
__________________________________________________________________________________________________
activation_5 (Activation)       (None, 124, 124, 192 0           batch_normalization_5[0][0]
__________________________________________________________________________________________________
max_pooling2d_2 (MaxPooling2D)  (None, 61, 61, 192)  0           activation_5[0][0]
__________________________________________________________________________________________________
conv2d_9 (Conv2D)               (None, 61, 61, 64)   12288       max_pooling2d_2[0][0]
__________________________________________________________________________________________________
batch_normalization_9 (BatchNor (None, 61, 61, 64)   192         conv2d_9[0][0]
__________________________________________________________________________________________________
activation_9 (Activation)       (None, 61, 61, 64)   0           batch_normalization_9[0][0]
__________________________________________________________________________________________________
conv2d_7 (Conv2D)               (None, 61, 61, 48)   9216        max_pooling2d_2[0][0]
__________________________________________________________________________________________________
conv2d_10 (Conv2D)              (None, 61, 61, 96)   55296       activation_9[0][0]
__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 61, 61, 48)   144         conv2d_7[0][0]
__________________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, 61, 61, 96)   288         conv2d_10[0][0]
__________________________________________________________________________________________________
activation_7 (Activation)       (None, 61, 61, 48)   0           batch_normalization_7[0][0]
__________________________________________________________________________________________________
activation_10 (Activation)      (None, 61, 61, 96)   0           batch_normalization_10[0][0]
__________________________________________________________________________________________________
average_pooling2d_1 (AveragePoo (None, 61, 61, 192)  0           max_pooling2d_2[0][0]
__________________________________________________________________________________________________
.
.
. 
many more lines

In the figure 3 of their paper (figure appended below), it is shown how the STEM block is formed for both InceptionV4 and InceptionResnetV2. In the figure 3 there are three filter concatenation in the STEM block, but in the output that I've shown you above, the concatenation seems to be a mixture of sequential maxpoolings or something like that (first concatenation should appear near after the max_pooling2d_1). It increases the number of filters as the concatenation should do, but no concatenation is being made. The filters seem to be put sequentially! Anyone has a clue what's going on in this output? It acts the same as the one described in the paper?

For comparison, I've found a InceptionV4 keras implementation, and they do seem to do a filter_concat in concatenate_1 for the first concatenation in STEM block. Here is the output of the first lines of summary().

Model: "inception_v4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            (None, 512, 512, 3)  0
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 255, 255, 32) 864         input_1[0][0]
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 255, 255, 32) 96          conv2d_1[0][0]
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 255, 255, 32) 0           batch_normalization_1[0][0]
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 253, 253, 32) 9216        activation_1[0][0]
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 253, 253, 32) 96          conv2d_2[0][0]
__________________________________________________________________________________________________
activation_2 (Activation)       (None, 253, 253, 32) 0           batch_normalization_2[0][0]
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, 253, 253, 64) 18432       activation_2[0][0]
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 253, 253, 64) 192         conv2d_3[0][0]
__________________________________________________________________________________________________
activation_3 (Activation)       (None, 253, 253, 64) 0           batch_normalization_3[0][0]
__________________________________________________________________________________________________
conv2d_4 (Conv2D)               (None, 126, 126, 96) 55296       activation_3[0][0]
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 126, 126, 96) 288         conv2d_4[0][0]
__________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)  (None, 126, 126, 64) 0           activation_3[0][0]
__________________________________________________________________________________________________
activation_4 (Activation)       (None, 126, 126, 96) 0           batch_normalization_4[0][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 126, 126, 160 0           max_pooling2d_1[0][0]
                                                                 activation_4[0][0]
__________________________________________________________________________________________________
conv2d_7 (Conv2D)               (None, 126, 126, 64) 10240       concatenate_1[0][0]
__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 126, 126, 64) 192         conv2d_7[0][0]
__________________________________________________________________________________________________
activation_7 (Activation)       (None, 126, 126, 64) 0           batch_normalization_7[0][0]
__________________________________________________________________________________________________
conv2d_8 (Conv2D)               (None, 126, 126, 64) 28672       activation_7[0][0]
__________________________________________________________________________________________________
.
.
.
and many more lines

So, as shown in the paper, both architectures should have the first layers identically. Or I am missing something?

EDIT: I've found that, the Implementation of InceptionResnetV2 from Keras is not following the STEM block for InceptionResnetV2, but instead the implementation for InceptionResnetV1 (Figure 14 from their paper, appended below). After the STEM block, it seems to follow the other blocks of InceptionResnetV2 nicely.

The InceptionResnetV1 doesn't perform as better as InceptionResnetV2 (figure 25), so I'm sceptical in using blocks from V1 instead of full V2 from keras. I'll try to chop the STEM from InceptionV4 i've found, and put the continuation of InceptionResnetV2.

Same question was closed with no explanation in the tf-models github. I leave it here if someone is interested: https://github.com/tensorflow/models/issues/1235

EDIT 2: For some reason, GoogleAI (the creators of Inception architecture) show an image in their blog when they released the code, of the "inception-resnet-v2". But the STEM block is the one from InceptionV3, not InceptionV4, as the one specified in the paper. So, either the paper is wrong, or the code is not following the paper for some reason.

Solution

It achieves similar results.

I just received an e-mail from Alex Alemi, Senior Research Scientist at Google and original publisher of the blog post regarding the release of the code for InceptionResnetV2, confirming the discrepancy. It seems that during internal experiments, the STEM blocks were switched, and the release simply retained that configuration.

Cite:

[...] Not entirely sure what happened but the code is obviously the source of truth in the sense that the released checkpoint is for the code that is also released. When we were developing the architecture we did a whole slew of internal experiments and I imagine at some point the stems were switched. Not sure I have the time to dig deeper at the moment, but like I said, the released checkpoint is a checkpoint for the released code as you can verify yourself by running the evaluation pipeline. I agree with you that it seems like this is using the original Inception V1 stem. Best Regards,

Alex Alemi

I'll update this post with changes regarding this subject.

UPDATE: Christian Szegedy, also publisher of the original paper, just agreed with previous mail:

The original experiments and model was created in DistBelief, a completely different framework pre-dating Tensorflow.

The TF version was added a year later and might have had discrepancies from the original model, however it was made sure to achieve similar results.

So, since it achieves similar results, the experiments would be roughly the same.