Using pyTorch and tensorflow (TF), I was wandering how the Adam optimizer is implemented for curiosity. And I do not know if I am wrong or not but it seems to me that the two implementations differ and the pyTorch one is the original one from https://arxiv.org/pdf/1412.6980.pdf.
My problem comes from the eps-parameter. Using the TF implentation seems to lead to a time-and-b2 dependance of this parameter, namely
q(t+1) = q(t) - \gamma * sqrt[(1-b2^t)]/(1-b1^t) * m(t)/[sqrt[v(t)]+eps]
which in the original algorithm notation can be reformulate as
q(t+1) = q(t) - \gamma * mhat(t)/[sqrt[vhat(t)]+ eps/sqrt[(1-b2^t)]]
and this point out the variation of the eps-parameter which is not the case neither in the original algorithm neither in the pyTorch implementation.
Am I wrong? or it is well known? Thanks for your help.
Indeed, you can check this in the docs for the TF Adam optimizer. To quote the relevant part:
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper.
If you check the "the formulation just before section 2.1" in the paper, they actually include the time dependence in alpha
, resulting in a time-dependent "step size" alpha_t
but a fixed epsilon
. Note that at the end of the day, this is just rewriting/interpreting parameters in a slightly different fashion and doesn't change the actual workings of the algorithm. But you will need to be aware that choosing the same epsilon
in the PyTorch and TF implementations will apparently not lead to the same results...