[SOLVED] Reproduce Tensorflow Hub module output with Tensorflow Slim

Reproduce Tensorflow Hub module output with Tensorflow Slim

I am trying to reproduce the output from a Tensorflow Hub module that is based on a Tensorflow Slim checkpoint, using the Tensorflow Slim modules. However, I can't seem to get the expected output. For example, let us load the required libraries, create a sample input and the placeholder to feed the data:

import tensorflow_hub as hub
from tensorflow.contrib.slim import nets

images = np.random.rand(1,224,224,3).astype(np.float32)
inputs = tf.placeholder(shape=[None, 224, 224, 3], dtype=tf.float32)

Load the TF Hub module:

resnet_hub = hub.Module("https://tfhub.dev/google/imagenet/resnet_v2_152/feature_vector/3")
features_hub = resnet_hub(inputs, signature="image_feature_vector", as_dict=True)["resnet_v2_152/block4"]

Now, let's do the same with TF Slim and create a loader that will load the checkpoint:

with slim.arg_scope(nets.resnet_utils.resnet_arg_scope()):
    _, end_points = nets.resnet_v2.resnet_v2_152(image, is_training=False)
    features_slim = end_points["resnet_v2_152/block4"]
loader = tf.train.Saver(tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="resnet_v2_152"))

Now, once we have everything in place we can test whether the outputs are the same:

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    loader.restore(sess, "resnet_v2_152_2017_04_14/resnet_v2_152.ckpt")
    slim_output = sess.run(features_slim, feed_dict={inputs: images})
    hub_output = sess.run(features_hub, feed_dict={inputs: images})
    np.testing.assert_array_equal(slim_output, hub_output)

However, the assertion fails because the two outputs are not the same. I assume that this is because TF Hub uses an internal preprocessing of the inputs that the TF Slim implementation lacks.

Let me know what you think!

Solution

Those Hub modules scale their inputs from the canonical range [0,1] to what the respective slim checkpoint expects from the preprocessing it was trained with (typically [-1,+1] for "Inception-style" preprocessing). Passing them the same inputs can explain a large difference. Even after linear rescaling to fix that, a difference up to compounded numerical error wouldn't surprise me (given the many degrees of freedom inside TF), but major discrepancies might indicate a bug.