[SOLVED] Scipy returning absurd curve fitting results using double nat log function

Scipy returning absurd curve fitting results using double nat log function

I am writing mass spectrometry data reduction software and am trying to fit a set of raw data y, t, where y is the intensity of a given gas species in the mass spectrometer and t is the timestamp associated with that intensity measurement, to a double natural log function.

I am using the following functions:

def double_ln_func(t, a, b, p, c, q, d):
    return a + b * np.log(p * t) + c * np.log(q * t)

def double_ln_fit(t, y):
    popt, pcov = curve_fit(double_ln_func, t, y)

    fitted_t = np.linspace(min(t), max(t), 100)
    fitted_y = double_ln_func(fitted_t, *popt)

    return fitted_t, fitted_y, popt

Typically, one of the natural log terms is positive to represent ingrowth from memory and the other is negative to represent consumption from ionization.

This generally returns reasonable results:

Data for the example above:

He 3798 Q1520_  2024-06-25T09:49:42 
time_sec    2.22    3.25    4.24    40.27
2.691   1.918947E-9 1.353086E-9 1.083636E-9 2.424798E-11
8.393   1.924420E-9 1.352441E-9 1.081825E-9 2.451031E-11
13.995  1.924468E-9 1.350765E-9 1.082597E-9 2.493171E-11
19.595  1.925282E-9 1.349962E-9 1.081301E-9 2.494261E-11
25.195  1.927887E-9 1.349370E-9 1.080510E-9 2.491214E-11
30.895  1.930570E-9 1.348243E-9 1.079783E-9 2.517592E-11
36.495  1.932480E-9 1.347466E-9 1.079611E-9 2.510602E-11
42.095  1.931847E-9 1.348131E-9 1.079237E-9 2.513002E-11
47.795  1.934760E-9 1.346782E-9 1.078437E-9 2.512803E-11
53.395  1.929853E-9 1.345735E-9 1.078367E-9 2.498039E-11
58.996  1.934834E-9 1.345541E-9 1.077280E-9 2.520110E-11
64.596  1.932654E-9 1.344552E-9 1.077504E-9 2.550219E-11

however, in other cases, the fit lines become jagged and non-parametric, such as in the 4 and 40 amu examples below:

Data for the above example:

He 3800 CB1 2024-06-25T10:19:42 
time_sec    2.22    3.25    4.24    40.27
1.616   1.890576E-9 1.359579E-9 7.204719E-13 1.014118E-11
7.218   1.894592E-9 1.357168E-9 7.196407E-13 1.130713E-11
12.819  1.894756E-9 1.355699E-9 7.183293E-13 1.139818E-11
18.519  1.895724E-9 1.354677E-9 7.262793E-13 1.098621E-11
24.119  1.899128E-9 1.354235E-9 7.101848E-13 1.077834E-11
29.729  1.898467E-9 1.353402E-9 7.125748E-13 1.187576E-11
35.430  1.899100E-9 1.351661E-9 7.355859E-13 1.104034E-11
41.032  1.899836E-9 1.352259E-9 7.077889E-13 1.089571E-11
46.633  1.902685E-9 1.351637E-9 7.197934E-13 1.141361E-11
52.235  1.902119E-9 1.351389E-9 7.155005E-13 1.151000E-11
57.835  1.902982E-9 1.350957E-9 7.118936E-13 1.155836E-11
63.535  1.907762E-9 1.351266E-9 7.195685E-13 1.208988E-11

Although the results above contain a high degree of scatter, and these types of "solutions" often appear for high-scatter measurements, the function is capable of fitting many high-scatter cases without resorting to this type of "solution".

On occasion, I see a warning in the terminal:

OptimizeWarning: Covariance of the parameters could not be estimated

but not often enough to explain every instance of the 'crazy' fits.

For some reason, 4 amu data is most sensitive and most prone to being mis-fit than any other data set.

How is it possible that curve_fit is returning parameters that plot like this, and how can I force it to stick to 'real' solutions? Or perhaps there's a better function than curve_fit in this instance?

Solution

Bah, of course I figured it out right after I wrote this post.

It was a scaling issue, due to the raw intensity numbers being so low.

This produces perfectly reasonable fits for most data:

def double_ln_func(t, a, b, p, c, q):
    return a + b * np.log(p * t) - c * np.log(q * t)

def double_ln_fit(t, y):
    y = y * 1e10
    
    popt, pcov = curve_fit(double_ln_func, t, y)#, p0=initial_guess)
    
    fitted_t = np.linspace(min(t), max(t), 100)
    fitted_y = double_ln_func(fitted_t, *popt) / 1e10
    
    return fitted_t, fitted_y, popt

It refuses to fit some measurements entirely, but that's a topic for a different post.