Here's an example to clarify what I mean:
First session.run():
First run of a TensorFlow session
Later session.run():
Later runs of a TensorFlow session
I understand TensorFlow is doing some initialization here, but I'd like to know where in the source this manifests. This occurs on CPU as well as GPU, but the effect is more prominent on GPU. For example, in the case of a explicit Conv2D operation, the first run has a much larger quantity of Conv2D operations in the GPU stream. In fact, if I change the input size of the Conv2D, it can go from tens to hundreds of stream Conv2D operations. In later runs, however, there are always only five Conv2D operations in the GPU stream (regardless of input size). When running on CPU, we retain the same operation list in the first run compared to later runs, but we do see the same time discrepancy.
What portion of TensorFlow source is responsible for this behavior? Where are GPU operations "split?"
Thanks for the help!
The tf.nn.conv_2d()
op takes much longer to run on the first tf.Session.run()
invocation because—by default—TensorFlow uses cuDNN's autotune facility to choose how to run subsequent convolutions as fast as possible. You can see the autotune invocation here.
There is an undocumented environment variable that you can use to disable autotune. Set TF_CUDNN_USE_AUTOTUNE=0
when you start the process running TensorFlow (e.g. the python
interpreter) to disable its use.