I would like to understand the difference between SGD and GD on the easiest example of function: y=x**2
The function of GD is here:
def gradient_descent(
gradient, start, learn_rate, n_iter=50, tolerance=1e-06
):
vector = start
for _ in range(n_iter):
diff = -learn_rate * gradient(vector)
if np.all(np.abs(diff) <= tolerance):
break
vector += diff
return vector
And in order to find the minimum of x**2 function we should do next (the answer is almost 0, which is correct):
gradient_descent(gradient=lambda v: 2 * x, start=10.0, learn_rate=0.2)
How i understood, in the classical GD the gradient is calculated precisely from all the data points. What is "all data points" in the implementation that i showed above?
And further, how we should modernize this function in order to call it SGD (SGD uses one single data point for calculating gradient. Where is "one single point" in the gradient_descent
function?)
The function minimized in your example does not depend on any data, so it is not helpful to illustrate the difference between GD and SGD.
Consider this example:
import numpy as np
rng = np.random.default_rng(7263)
y = rng.normal(loc=10, scale=4, size=100)
def loss(y, mean):
return 0.5 * ((y-mean)**2).sum()
def gradient(y, mean):
return (mean - y).sum()
def mean_gd(y, learning_rate=0.005, n_iter=15, start=0):
"""Estimate the mean of y using gradient descent"""
mean = start
for i in range(n_iter):
mean -= learning_rate * gradient(y, mean)
print(f'Iter {i} mean {mean:0.2f} loss {loss(y, mean):0.2f}')
return mean
def mean_sgd(y, learning_rate=0.005, n_iter=15, start=0):
"""Estimate the mean of y using stochastic gradient descent"""
mean = start
for i in range(n_iter):
rng.shuffle(y)
for single_point in y:
mean -= learning_rate * gradient(single_point, mean)
print(f'Iter {i} mean {mean:0.2f} loss {loss(y, mean):0.2f}')
return mean
mean_gd(y)
mean_sgd(y)
y.mean()
Two (very simplistic) versions of GD and SGD are used to estimate the mean of a random sample y
. Estimating the mean is achieved by minimizing the squared loss
.
As you understood correctly, in GD each update uses the gradient computed on the whole dataset and in SGD we look at a single random point at a time.