I recently watched Andrew Ng's video on SGDM. I understand that the momentum term updates the gradient by weighting the last gradient and using a small component of V_dw. I don't understand why momentum is also known as exponentially weighted average. Also, in Ng's video at 6:37 he says using Beta = 0.9 effectively means using an average of the last 10 gradients. Can someone explain how that works? To me, it's just a scalar weighting of 1-0.9 to all the gradients in the vector dW.
Appreciate any insight! I feel like I'm missing something fundamental.
You just have to think about what is in your last gradient. The last gradient is already a weighted gradient, due to the momentum term.
In the first step you will just do a gradient descent. In the second step you will have a momentum gradient of m_grad_2 = grad_2 + 0.9 m_grad_1. In the third step you will again have a momentum gradient m_grad_3 = grad_3 + 0.9 m_grad_2, but the old gradient is composed of a momentum term. Therefore 0.9*m_grad_2 = 0.9 * (grad_2 + 0.9 grad_1), which is 0.9 grad_2 + 0.81 grad_1. Therefore the impact of a gradient on the kth step will be 0.9^k. After 10 steps the impact will be quite small.