
Tensorflow Profile outputs 2 FLOPS for a Conv2D instead of 1

I was wondering if anybody has an idea why the number of FLOPs for a Conv2d operation is 2 instead of 1. In the example below, the input is a 1x1 image with 1 channel and the batch size is 1. The number of features in the convolution is also 1 with no bias. Ideally the number of multiplication should be 1. But the output of TF profiler says that the FLOPs is 2. Does the FLOPs include something other than the multiplication? Thanks.

Here is the example:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'  # assuming you have a gpu0
import tensorflow as tf
from keras import backend as K

def load_pb(pb):
    with tf.gfile.GFile(pb, "rb") as f:
        graph_def = tf.GraphDef()
    with tf.Graph().as_default() as graph:
        tf.import_graph_def(graph_def, name='')
        return graph
def freeze_session(session, keep_var_names=None, output_names=None, clear_devices=True):
    from tensorflow.python.framework.graph_util import convert_variables_to_constants
    graph = session.graph
    with graph.as_default():
        freeze_var_names = list(set(v.op.name for v in tf.global_variables()).difference(keep_var_names or []))
        output_names = output_names or []
        output_names += [v.op.name for v in tf.global_variables()]
        input_graph_def = graph.as_graph_def()
        if clear_devices:
            for node in input_graph_def.node:
                node.device = ""
        frozen_graph = convert_variables_to_constants(session, input_graph_def,output_names, freeze_var_names)
        return frozen_graph
# define the model
inp = tf.keras.layers.Input(batch_shape=(1, 1, 1, 1), name='input')
x = tf.keras.layers.Conv2D(1, kernel_size=(1, 1), strides=(1, 1), padding='same', name='conv', use_bias=False)(inp)
out = tf.keras.layers.Flatten(name='output')(x)
model = tf.keras.models.Model(inputs=inp, outputs=out)

# freeze the model
output_graph_def = freeze_session(K.get_session(), output_names=[out.op.name for out in model.outputs])
with tf.gfile.GFile('graph.pb', "wb") as f:
# load the protobuf and perform tf profiling
g2 = load_pb('./graph.pb')
with g2.as_default():
    opts = tf.profiler.ProfileOptionBuilder.float_operation()
    flops = tf.profiler.profile(g2, run_meta=tf.RunMetadata(), cmd='scope', options=opts)
    print('FLOP', flops.total_float_ops)

The output is:

Layer (type)                 Output Shape              Param #
input (InputLayer)           (1, 1, 1, 1)              0                                                                                                                                                
conv (Conv2D)                (1, 1, 1, 1)              1                                                                                                                                                 
output (Flatten)             (1, 1)                    0
Total params: 1
Trainable params: 1
Non-trainable params: 0
Converted 1 variables to const ops.
Parsing Inputs...
-max_depth                  10000
-min_bytes                  0
-min_peak_bytes             0
-min_residual_bytes         0
-min_output_bytes           0
-min_micros                 0
-min_accelerator_micros     0
-min_cpu_micros             0
-min_params                 0
-min_float_ops              1
-min_occurrence             0
-step                       -1
-order_by                   float_ops
-account_type_regexes       .*
-start_name_regexes         .*
-show_name_regexes          .*
-account_displayed_op_only  true
-select                     float_ops
-output                     stdout:
==================Model Analysis Report======================
scope: The nodes in the model graph are organized by their names, which is hierarchical like filesystem.
flops: Number of float operations. Note: Please read the implementation for the math behind it.
node name | # float_ops
_TFProfRoot (--/2 flops)
  conv/Conv2D (2/2 flops)
======================End of Report==========================


  • Consider almost the same setup as you have, but instead there are n channels to the convolution. Then you would have n multiplications, and then you would cumulatively sum the results of all multiplications. Now one can say that you can initialize the sum by the result of the first multiplication, and then cumulatively sum the rest of then (n-1) multiplications. But this would be a special treatment to the first multiplication, and instead it makes more sense to initialize the sum by 0, and then cumulatively sum it with all n multiplications. In particular when n=1 you would have an absurd case where

    sum = 0
    mult = w1 * a1
    sum = sum + mult

    which will result in 2 FLOPs, or 1 MAC which is (multiply-accumulate) operation.