I understand what Gradient Descent does. Basically it tries to move towards the local optimal solution by slowly moving down the curve. I am trying to understand what is the actual difference between the plain gradient descent and the Newton's method?
From Wikipedia, I read this short line "Newton's method uses curvature information to take a more direct route." What does this intuitively mean?
At a local minimum (or maximum) x
, the derivative of the target function f
vanishes: f'(x) = 0
(assuming sufficient smoothness of f
).
Gradient descent tries to find such a minimum x
by using information from the first derivative of f
: It simply follows the steepest descent from the current point. This is like rolling a ball down the graph of f
until it comes to rest (while neglecting inertia).
Newton's method tries to find a point x
satisfying f'(x) = 0
by approximating f'
with a linear function g
and then solving for the root of that function explicitely (this is called Newton's root-finding method). The root of g
is not necessarily the root of f'
, but it is under many circumstances a good guess (the Wikipedia article on Newton's method for root finding has more information on convergence criteria). While approximating f'
, Newton's method makes use of f''
(the curvature of f
). This means it has higher requirements on the smoothness of f
, but it also means that (by using more information) it often converges faster.