I am trying to perform gradient descent to compute three MLEs (from scratch). I have data $x_i=s_i+w_i$ where $s_i=A(nu_i/nu_0)^{alpha}(nu_i/nu_0+1)^{-4alpha}$ where I have calculated the first derivatives analytically:
import numpy as np
import matplotlib.pyplot as plt
#model function s_i
def signal(A,nu_0,alpha,nu):
return A*(nu/nu_0)**alpha*(1+nu/nu_0)**(-4*alpha)
#partial derivatives of log likelihood function
def MLE_A(A,nu_0,alpha,nu,x_i):
nu=np.array(nu)
x_i=np.array(x_i)
return -np.sum(((nu/nu_0)**alpha*((A*(nu/nu_0)**alpha)/(nu/nu_0+1)**(4*alpha)-x_i))/(nu/nu_0+1)**(4*alpha))
def MLE_alpha(A,nu_0,alpha,nu,x_i):
nu=np.array(nu)
x_i=np.array(x_i)
return -np.sum((A*(nu/nu_0)**alpha*(4*np.log(nu/nu_0+1)-np.log(nu/nu_0))*(x_i*(nu/nu_0+1)**(4*alpha)-A*(nu/nu_0)**alpha))/(nu/nu_0+1)**(8*alpha))
def MLE_nu_0(A,nu_0,alpha,nu,x_i):
nu=np.array(nu)
x_i=np.array(x_i)
return -np.sum((A*alpha*(nu/nu_0)**(alpha)*(nu_0-3*nu)*((x_i*((nu)/nu_0+1)**(4*alpha))-A*(nu/nu_0)**alpha))/(nu_0*(nu+nu_0)*((nu)/nu_0+1)**(8*alpha)))
def gradient_descent(A_init,nu_0_init,alpha_init,nu,x_i,iterations=1000,learning_rate=0.01):
A=A_init
nu_0=nu_0_init
alpha=alpha_init
theta=np.array([A_init,nu_0_init,alpha_init])
updated_theta=[theta]
for i in range(iterations):
new_theta = theta - learning_rate * np.array([MLE_A(A,nu_0,alpha,nu,x_i), MLE_nu_0(A,nu_0,alpha,nu,x_i), MLE_alpha(A,nu_0,alpha,nu,x_i)])
theta=new_theta
updated_theta.append(theta)
A,nu_0,alpha = new_theta[0],new_theta[1],new_theta[2]
return(updated_theta)
A=6
nu_0=2
alpha=1
nu=np.linspace(0.05,1.0,200)
x_i=signal(A,nu_0,alpha,nu)+np.random.normal(0,0.05,len(nu))
params= gradient_descent(A,nu_0,alpha,nu,x_i,iterations=10000,learning_rate=0.01)
print(params[-1])
A_fit=params[-1][0]
nu_0_fit=params[-1][1]
alpha_fit=params[-1][2]
plt.plot(nu,x_i)
plt.plot(nu,signal(A_fit,nu_0_fit,alpha_fit,nu))
plt.show()
Sometimes I get errors like RuntimeWarning: overflow encountered in power and RuntimeWarning: invalid value encountered in true_divide and sometimes I get wildly off values. I used different values for the learning rate and it did not fix it. I have checked the functions by hand and using symbolic software. Also, I used latexify to see that I did type them in right so I am assuming it is my implementation of the gradient descent that is somehow off.
Your MCVE seems to have hard time with some xi
values when assessing derivatives: it experiences overflow or division by 0. As a consequence, your parameter vector contains NaN
and the algorithm fails.
For some setup, the problem is not present, for instance fixing the random seed to:
np.random.seed(1234567)
Generate a dataset that prevents your code to fail and then we get the following result:
# [5.66701368 2.18827948 5.48118174]
Which may indicate that the last derivative is not exact.
Performing gradient descent for MSE with numerical assessment of the gradient works fine (for other seeds as well) and then is model independent (user does not have to provide derivatives):
import numpy as np
import numdifftools as nd
from scipy import optimize
import matplotlib.pyplot as plt
def model(x, A, nu0, alpha):
return A * np.power((x/nu0), 1. * alpha) * np.power((1 + x/nu0), (-4. * alpha))
def gradient_factory(model, x, y):
def wrapped(p):
return 0.5 * np.sum(np.power(y - model(x, *p), 2.))
return nd.Gradient(wrapped)
np.random.seed(1234567)
p0 = (6, 2, 1)
nu = np.linspace(0.05, 1.0, 200)
xi = model(nu, *p0) + 0.05 * np.random.normal(size=nu.size)
def gradient_descent(model, x, y, p0, tol=1e-16, maxiter=500, rate=0.001, atol=1e-10, rtol=1e-8):
gradient = gradient_factory(model, x, y)
p = np.array(p0)
dp = gradient(p)
for _ in range(maxiter):
# Update gradient descent:
p_ = p - rate * dp
dp_ = gradient(p_)
# Update rate:
Dp_ = p_ - p
Ddp_ = dp_ - dp
rate = np.abs(Dp_.T @ Ddp_) / np.power(np.linalg.norm(Ddp_), 2)
# Break when precision is reached:
if np.allclose(p, p_, atol=atol, rtol=rtol):
break
# Next iteration:
p = p_
dp = dp_
else:
raise RuntimeError("Max Iteration (maxiter=%d) reached" % maxiter)
return p
p = gradient_descent(model, nu, xi, [1., 1., 1.])
# array([5.82733367, 2.06411618, 0.98227514])
And converges toward the same solution than curve_fit
:
popt, pcov = optimize.curve_fit(model, nu, xi)
# (array([5.82905928, 2.06399904, 0.98240413]),
# array([[ 0.56149129, -0.03771281, 0.04195576],
# [-0.03771281, 0.00493833, -0.00287965],
# [ 0.04195576, -0.00287965, 0.00314415]]))
You may adapt this MCVE to perform MLE instead of minimizing MSE. You can also provide your own definition of gradient to drive the descent instead of assessing it numerically.