I'm having a bit of trouble with fitting a curve to some data, but can't work out where I am going wrong.
In the past I have done this with numpy.linalg.lstsq
for exponential functions and scipy.optimize.curve_fit
for sigmoid functions. This time I wished to create a script that would let me specify various functions, determine parameters and test their fit against the data. While doing this I noticed that Scipy leastsq
and Numpy lstsq
seem to provide different answers for the same set of data and the same function. The function is simply y = e^(l*x)
and is constrained such that y=1
at x=0
.
Excel trend line agrees with the Numpy lstsq
result, but as Scipy leastsq
is able to take any function, it would be good to work out what the problem is.
import scipy.optimize as optimize
import numpy as np
import matplotlib.pyplot as plt
## Sampled data
x = np.array([0, 14, 37, 975, 2013, 2095, 2147])
y = np.array([1.0, 0.764317544, 0.647136491, 0.070803763, 0.003630962, 0.001485394, 0.000495131])
# function
fp = lambda p, x: np.exp(p*x)
# error function
e = lambda p, x, y: (fp(p, x) - y)
# using scipy least squares
l1, s = optimize.leastsq(e, -0.004, args=(x,y))
print l1
# [-0.0132281]
# using numpy least squares
l2 = np.linalg.lstsq(np.vstack([x, np.zeros(len(x))]).T,np.log(y))[0][0]
print l2
# -0.00313461628963 (same answer as Excel trend line)
# smooth x for plotting
x_ = np.arange(0, x[-1], 0.2)
plt.figure()
# plt.plot(x, y, 'rx', x_, fp(l1, x_), 'b-', x_, fp(l2, x_), 'g-')
plt.plot(x, y, 'rx', label= 'data')
plt.plot(x_, fp(l1, x_), 'b-', label= 'scipy.optimize.leastsq fit')
plt.plot(x_, fp(l2, x_), 'g-', label= 'np.linalg.lstsq fit')
plt.legend()
plt.show()
The MWE above includes a small sample of the dataset. When fitting the actual data the scipy.optimize.curve_fit
curve presents an R^2 of 0.82, while the numpy.linalg.lstsq
curve, which is the same as that calculated by Excel, has an R^2 of 0.41.
You are minimizing different error functions.
When you use numpy.linalg.lstsq
, the error function being minimized is
np.sum((np.log(y) - p * x)**2)
while scipy.optimize.leastsq
minimizes the function
np.sum((y - np.exp(p * x))**2)
The first case requires a linear dependency between the dependent and independent variables, but the solution is known analitically, while the second can handle any dependency, but relies on an iterative method.
On a separate note, I cannot test it right now, but when using numpy.linalg.lstsq
, I you don't need to vstack
a row of zeros, the following works as well:
l2 = np.linalg.lstsq(x[:, None], np.log(y))[0][0]