python scipy curve-fitting scipy-optimize

Why doesn't scipy.optimize.curve_fit fit the data?

I'm having some trouble using curve_fit to fit some curve. From visual inspection (and knowledge of the domain), the data is sinusoidal in nature. However, curve_fit does not match the data at all.

Here is my code, as I have it now:

from scipy.optimize import curve_fit as cf
import numpy as np
from matplotlib import pyplot as plt

xdata=np.arange(0.05,10.01,0.05)

ydata="""
0.4000752
0.4231248
0.4456256
0.4684008
0.4892552
0.5054448
0.5150488
0.5155976
0.5081888
0.4919992
0.4716936
0.4494672
0.4258688
0.4058376
0.3901968
0.3803184
0.3794952
0.3866296
0.400624
0.4201064
0.4428816
0.4656568
0.4873344
0.5032496
0.5120304
0.513128
0.5070912
0.492548
0.4730656
0.4522112
0.4297104
0.4085816
0.3926664
0.3866296
0.3825136
0.3885504
0.4030936
0.4228504
0.4448024
0.4656568
0.48706
0.5029752
0.5120304
0.5134024
0.5068168
0.4928224
0.47334
0.4502904
0.4269664
0.4063864
0.3904712
0.3811416
0.3797696
0.3866296
0.4000752
0.4203808
0.4434304
0.4662056
0.4895296
0.5057192
0.518616
0.5191648
0.5101096
0.4988592
0.5005056
0.460992
0.4310824
0.4096792
0.3934896
0.3808672
0.380044
0.3866296
0.4000752
0.4201064
0.4428816
0.4659312
0.4865112
0.5024264
0.5114816
0.513128
0.5068168
0.4933712
0.4738888
0.4519368
0.4288872
0.4072096
0.3912944
0.385532
0.3797696
0.3888248
0.4058376
0.42532
0.444528
0.4673032
0.4919992
0.50764
0.513128
0.5128536
0.5059936
0.4950176
0.4771816
0.4511136
0.4286128
0.407484
0.3912944
0.3819648
0.380044
0.3860808
0.3998008
0.4190088
0.4415096
0.4648336
0.4851392
0.5013288
0.5112072
0.5128536
0.5125792
0.4922736
0.47334
0.4533088
0.4305336
0.4099536
0.393764
0.381416
0.3792208
0.385532
0.3989776
0.4190088
0.4409608
0.4640104
0.4854136
0.502152
0.5112072
0.513128
0.5068168
0.492548
0.4736144
0.4516624
0.4291616
0.4077584
0.3921176
0.382788
0.3803184
0.3863552
0.4000752
0.4179112
0.4406864
0.4634616
0.4865112
0.5029752
0.5125792
0.5142256
0.5081888
0.4947432
0.4736144
0.4519368
0.4286128
0.407484
0.3918432
0.3822392
0.3803184
0.3860808
0.3998008
0.4181856
0.440412
0.4634616
0.4837672
0.5002312
0.510384
0.5128536
0.5065424
0.4933712
0.4752608
0.4522112
0.4297104
0.4080328
0.3912944
0.381416
0.3794952
0.3880016
0.3987032
0.4179112
0.4420584
0.4626384
0.4832184
0.499408
0.5125792
0.5128536
0.506268
0.4933712
0.4749864
0.45276
0.4297104
0.4091304
0.3926664
0.382788
0.3808672
0.3866296
0.3998008
0.41846
"""

ydata=[float(_) for _ in ydata.strip().splitlines()]
print(ydata)
def sin_fun(x,a,b,c,d):
    return a*np.cos(b*x+c)+d

p_opt,p_cov=cf(sin_fun,xdata,ydata, p0=(0.05, 0.5, 0.01,0.45),method='trf')

print(p_opt)
plt.plot(xdata,sin_fun(xdata,*p_opt))
plt.plot(xdata,ydata, 'r.-', ms=1)
plt.show()

I tried scaling the graph vertically and horizontally, but it seem to make things worse (the amplitude was now smaller than 0.05, not larger than it).

Solution

The thing to understand about gradient-based optimizers like curve_fit is that they're very bad at fitting sine waves. Here is a plot of your objective function, while only varying b. The X axis represents the b parameter. The Y axis is the root mean square error of the fit, with lower values meaning a better fit.

The problem with curve_fit is that it's a local optimizer. Any local optimizer can get stuck in local optima, and a program that tries to fit a frequency parameter for sine wave will get stuck in lots of local optima. Any valley in the above function is a place where it can get stuck.

There are a few things you can try here.

Use a better guess.
Use a global optimizer.

Here's how to solve this by using a better initial guess.

I would suggest using an FFT to figure out what frequency to use.

Example:

fft = np.fft.fft(ydata - np.array(ydata).mean())
timestep = xdata[1] - xdata[0]
freq = np.fft.fftfreq(len(fft), d=timestep)
largest_component = np.abs(fft).argmax()
phase_guess = np.angle(fft[largest_component]) * freq[largest_component]
initial_frequency_guess = freq[largest_component] * 2 * np.pi


p0 = np.array([
    np.std(ydata) * np.sqrt(2),
    initial_frequency_guess,
    phase_guess,
    np.mean(ydata)
])
p_opt,p_cov=cf(sin_fun,xdata,ydata, p0=p0,method='trf')

The most important one is initial_frequency_guess. The optimizer can figure out the other ones.

Here is the fit this guess produces:

Then polish it with curve_fit():