I am trying to use Horovod for distributing training GPU on different servers. Following the advice Here.
I wanted to implement local gradient aggregation. In the explanation the modification looks easy optimizer = hvd.DistributedOptimizer(opt, backward_passes_per_step=4)
.
But trying to use it in my example model results in the following error.
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
[1,4]<stderr>: (0) Failed precondition: Attempting to use uninitialized value aggregation_variables_4/aggregation_counter
[1,4]<stderr>: [[node aggregation_variables_4/aggregation_counter/read
I am using the native TensorFlow 1.15 not keras or latest tensorflow version.
Is there a working example for this? or someone know how to implement it?
I have solved the problem. As indicated in the error message aggregation_counter
variable is not initialized. I was using sess.run(tf.global_variables_initializer())
. To solve the problem I add sess.run(tf.local_variables_initializer())
. Doing this has done the trick. I am not yet sure why the global variable initializer failed to initialize the aggregation_counter
variable.