pythonfunctionmachine-learningruntime-errorgradient-descent

Multivariable Gradient Descent for MLEs (nonlinear model) in Python


I am trying to perform gradient descent to compute three MLEs (from scratch). I have data $x_i=s_i+w_i$ where $s_i=A(nu_i/nu_0)^{alpha}(nu_i/nu_0+1)^{-4alpha}$ where I have calculated the first derivatives analytically:

import numpy as np
import matplotlib.pyplot as plt

#model function s_i
def signal(A,nu_0,alpha,nu):
    return A*(nu/nu_0)**alpha*(1+nu/nu_0)**(-4*alpha)

#partial derivatives of log likelihood function
def MLE_A(A,nu_0,alpha,nu,x_i):
    nu=np.array(nu)
    x_i=np.array(x_i)
    return -np.sum(((nu/nu_0)**alpha*((A*(nu/nu_0)**alpha)/(nu/nu_0+1)**(4*alpha)-x_i))/(nu/nu_0+1)**(4*alpha))

def MLE_alpha(A,nu_0,alpha,nu,x_i):
    nu=np.array(nu)
    x_i=np.array(x_i)
    
    return -np.sum((A*(nu/nu_0)**alpha*(4*np.log(nu/nu_0+1)-np.log(nu/nu_0))*(x_i*(nu/nu_0+1)**(4*alpha)-A*(nu/nu_0)**alpha))/(nu/nu_0+1)**(8*alpha))

def MLE_nu_0(A,nu_0,alpha,nu,x_i):
    nu=np.array(nu)
    x_i=np.array(x_i)
    
    return -np.sum((A*alpha*(nu/nu_0)**(alpha)*(nu_0-3*nu)*((x_i*((nu)/nu_0+1)**(4*alpha))-A*(nu/nu_0)**alpha))/(nu_0*(nu+nu_0)*((nu)/nu_0+1)**(8*alpha)))

 
def gradient_descent(A_init,nu_0_init,alpha_init,nu,x_i,iterations=1000,learning_rate=0.01):
    
    A=A_init
    nu_0=nu_0_init
    alpha=alpha_init
    theta=np.array([A_init,nu_0_init,alpha_init])
    updated_theta=[theta]
    for i in range(iterations):
        new_theta = theta - learning_rate * np.array([MLE_A(A,nu_0,alpha,nu,x_i), MLE_nu_0(A,nu_0,alpha,nu,x_i), MLE_alpha(A,nu_0,alpha,nu,x_i)])
        theta=new_theta
        updated_theta.append(theta)
        A,nu_0,alpha = new_theta[0],new_theta[1],new_theta[2]
    return(updated_theta)

A=6
nu_0=2
alpha=1
nu=np.linspace(0.05,1.0,200)
x_i=signal(A,nu_0,alpha,nu)+np.random.normal(0,0.05,len(nu))


params= gradient_descent(A,nu_0,alpha,nu,x_i,iterations=10000,learning_rate=0.01)

print(params[-1])
A_fit=params[-1][0]
nu_0_fit=params[-1][1]
alpha_fit=params[-1][2]

plt.plot(nu,x_i)
plt.plot(nu,signal(A_fit,nu_0_fit,alpha_fit,nu))

plt.show()


Sometimes I get errors like RuntimeWarning: overflow encountered in power and RuntimeWarning: invalid value encountered in true_divide and sometimes I get wildly off values. I used different values for the learning rate and it did not fix it. I have checked the functions by hand and using symbolic software. Also, I used latexify to see that I did type them in right so I am assuming it is my implementation of the gradient descent that is somehow off.


Solution

  • Observations

    Your MCVE seems to have hard time with some xi values when assessing derivatives: it experiences overflow or division by 0. As a consequence, your parameter vector contains NaN and the algorithm fails.

    For some setup, the problem is not present, for instance fixing the random seed to:

    np.random.seed(1234567)
    

    Generate a dataset that prevents your code to fail and then we get the following result:

    # [5.66701368 2.18827948 5.48118174]
    

    Which may indicate that the last derivative is not exact.

    MCVE

    Performing gradient descent for MSE with numerical assessment of the gradient works fine (for other seeds as well) and then is model independent (user does not have to provide derivatives):

    import numpy as np
    import numdifftools as nd
    from scipy import optimize
    import matplotlib.pyplot as plt
    
    def model(x, A, nu0, alpha):
        return A * np.power((x/nu0), 1. * alpha) * np.power((1 + x/nu0), (-4. * alpha))
    
    def gradient_factory(model, x, y):
        def wrapped(p):
            return 0.5 * np.sum(np.power(y - model(x, *p), 2.))
        return nd.Gradient(wrapped)
    
    np.random.seed(1234567)
    p0 = (6, 2, 1)
    nu = np.linspace(0.05, 1.0, 200)
    xi = model(nu, *p0) + 0.05 * np.random.normal(size=nu.size)
    
    def gradient_descent(model, x, y, p0, tol=1e-16, maxiter=500, rate=0.001, atol=1e-10, rtol=1e-8):
    
        gradient = gradient_factory(model, x, y)
        
        p = np.array(p0)
        dp = gradient(p)
    
        for _ in range(maxiter):
    
            # Update gradient descent:
            p_ = p - rate * dp
            dp_ = gradient(p_)
    
            # Update rate:
            Dp_ = p_ - p
            Ddp_ = dp_ - dp
            rate = np.abs(Dp_.T @ Ddp_) / np.power(np.linalg.norm(Ddp_), 2)
    
            # Break when precision is reached:
            if np.allclose(p, p_, atol=atol, rtol=rtol):
                break
    
            # Next iteration:
            p = p_
            dp = dp_
    
        else:
            raise RuntimeError("Max Iteration (maxiter=%d) reached" % maxiter)
            
        return p
    
    p = gradient_descent(model, nu, xi, [1., 1., 1.])
    # array([5.82733367, 2.06411618, 0.98227514])
    

    enter image description here

    And converges toward the same solution than curve_fit:

    popt, pcov = optimize.curve_fit(model, nu, xi)
    # (array([5.82905928, 2.06399904, 0.98240413]),
    #  array([[ 0.56149129, -0.03771281,  0.04195576],
    #         [-0.03771281,  0.00493833, -0.00287965],
    #         [ 0.04195576, -0.00287965,  0.00314415]]))
    

    You may adapt this MCVE to perform MLE instead of minimizing MSE. You can also provide your own definition of gradient to drive the descent instead of assessing it numerically.