pythontensorflowtensorflow1.15horovod

Local Gradient Aggregation for Horovod using Tensorflow 1.X


I am trying to use Horovod for distributing training GPU on different servers. Following the advice Here.

I wanted to implement local gradient aggregation. In the explanation the modification looks easy optimizer = hvd.DistributedOptimizer(opt, backward_passes_per_step=4).
But trying to use it in my example model results in the following error.

tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
[1,4]<stderr>:  (0) Failed precondition: Attempting to use uninitialized value aggregation_variables_4/aggregation_counter
[1,4]<stderr>:   [[node aggregation_variables_4/aggregation_counter/read

I am using the native TensorFlow 1.15 not keras or latest tensorflow version.

Is there a working example for this? or someone know how to implement it?


Solution

  • I have solved the problem. As indicated in the error message aggregation_counter variable is not initialized. I was using sess.run(tf.global_variables_initializer()). To solve the problem I add sess.run(tf.local_variables_initializer()). Doing this has done the trick. I am not yet sure why the global variable initializer failed to initialize the aggregation_counter variable.