I'm trying to (1) simulate data with a Weibull distribution using known parameters, and then (2) trying to fit that simulated data with a Weibull function and get parameters back. The idea is that I will ultimately try to fit a Weibull to real data, but I've run into problems during very basic testing.
Here's my code:
from scipy.stats import weibull_min
from scipy.optimize import curve_fit
def weibull(x, shape, loc, scale):
return weibull_min.pdf(x, shape, loc, scale)
# Create a 3-second timecourse
t_start = 0
t_end = 3
samples = t_end * 100 # simulate 100 samples per second
t = np.linspace(t_start, t_end, samples)
# Set known (kn) parameters for the Weibull distribution
kn_shape, kn_loc, kn_scale = 1.5, 0, 1
# Generate the Weibull distribution with the known parameters
sim_data = weibull(t, kn_shape, kn_loc, kn_scale)
# Fit the simulated data with a Weibull function and return parameters
params, _ = curve_fit(weibull, t, sim_data_noisy)
The problem is that params
should be exactly the same as known_params
, but they are not. The following:
print(f'shape = {params[0]}, loc = {params[1]}, scale = {params[2]}')
...returns
shape = 1.4999999989589334, loc = 4.399587587206362e-10, scale = 0.9999999997072269
Of course, all three parameters are pretty damn close to the initial inputs, but I don't see why they are not identical. Can someone help me understand what is going on here?
I'm running Python 3.10.4 on a Linux server
The problem is that curve_fit()
is approximate. You're very unlikely to to get exactly the parameters you put in back out.
There are a few reasons for this.
First, it's using floating point math, which is only accurate to about 15 decimal places. Getting an answer accurate to 10 decimal places is still pretty good.
The second problem is that curve_fit()
will terminate before finding the best solution. Specifically, if the solution it finds changes by less than xtol in one iteration, or is within ftol of the last iteration, or the gradient is less than gtol, it will terminate.
Experimentally, I found a few things that helped. One was lowering the tolerances.
tol = 1e-10
params, _ = curve_fit(weibull, t, sim_data, ftol=tol, xtol=tol, gtol=tol)
Another thing that I found that helped was using the trust region reflection method.
tol = 1e-10
params, _ = curve_fit(weibull, t, sim_data, xtol=tol, gtol=tol, ftol=tol, method='trf')
This was sufficient to get all parameters within 1e-16.