pythonmachine-learninggradient-descentsgd

Stochastic Gradient Decent vs. Gradient Decent for x**2 function


I would like to understand the difference between SGD and GD on the easiest example of function: y=x**2

The function of GD is here:

def gradient_descent(
    gradient, start, learn_rate, n_iter=50, tolerance=1e-06
):
    vector = start
    for _ in range(n_iter):
        diff = -learn_rate * gradient(vector)
        if np.all(np.abs(diff) <= tolerance):
            break
        vector += diff
    return vector

And in order to find the minimum of x**2 function we should do next (the answer is almost 0, which is correct):

gradient_descent(gradient=lambda v: 2 * x, start=10.0, learn_rate=0.2)

How i understood, in the classical GD the gradient is calculated precisely from all the data points. What is "all data points" in the implementation that i showed above?

And further, how we should modernize this function in order to call it SGD (SGD uses one single data point for calculating gradient. Where is "one single point" in the gradient_descent function?)


Solution

  • The function minimized in your example does not depend on any data, so it is not helpful to illustrate the difference between GD and SGD.

    Consider this example:

    import numpy as np
    
    rng = np.random.default_rng(7263)
    
    y = rng.normal(loc=10, scale=4, size=100)
    
    
    def loss(y, mean):
        return 0.5 * ((y-mean)**2).sum()
    
    
    def gradient(y, mean):
        return (mean - y).sum()
    
    
    def mean_gd(y, learning_rate=0.005, n_iter=15, start=0):
        """Estimate the mean of y using gradient descent"""
    
        mean = start
    
        for i in range(n_iter):
    
            mean -= learning_rate * gradient(y, mean)
    
            print(f'Iter {i} mean {mean:0.2f} loss {loss(y, mean):0.2f}')
    
        return mean
    
    
    def mean_sgd(y, learning_rate=0.005, n_iter=15, start=0):
        """Estimate the mean of y using stochastic gradient descent"""
    
        mean = start
    
        for i in range(n_iter):
    
            rng.shuffle(y)
            for single_point in y:
                mean -= learning_rate * gradient(single_point, mean)
    
            print(f'Iter {i} mean {mean:0.2f} loss {loss(y, mean):0.2f}')
    
        return mean
    
    
    mean_gd(y)
    mean_sgd(y)
    y.mean()
    

    Two (very simplistic) versions of GD and SGD are used to estimate the mean of a random sample y. Estimating the mean is achieved by minimizing the squared loss. As you understood correctly, in GD each update uses the gradient computed on the whole dataset and in SGD we look at a single random point at a time.