pythonlogarithmdata-fittingmodel-fitting

Fitting a function to exponentially decreasing numbers: Ensuring equal weight for each data point


So this question is based on a biochemical experiment. For those who know a bit about biochemistry it is an enzyme kinetics experiment. I have a dilution series of an activator (a or x) and am measuring the enzyme velocity (y). The fitting equation (mmat) is derived from a biological model.

The challenge I'm facing is that in this model, the velocity doesn't reach zero or worse reach negative numbers. This crucial information is hidden in the data points at very low concentrations. When I fit my data using the equation, these low-concentration data points are often ignored due to the least-squares fitting function. However, I need to extract the information (alpha) hidden in these low-concentration points.

I believe I should use a logarithmic scale for x to ensure that the least-squares function doesn't overlook those data points. However, I'm struggling to implement this and could use some guidance, especially since I'm not that well versed in math or Python.

In my code example I implemented:

  1. Perfect test Data to show how it should look like. (orange in plot)
  2. Real Data, that can be fitted but is reaching negative velocity which is physically impossible . Sometimes it is also close to impossible to fit. (blue in plot)
  3. Also I made a fit using my obviously stupid idea of fitting the function with logarithmic scaling which should be possible I think but not how I tried it. This is basically what I would like to achieve... (green in plot)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

# My functions based on Michaelis-Menten implementing activators
def mmat(a, v_max, K_D, alpha):
    S=2.5e6
    K_S=1e6
    return (v_max * (1 + (a/(alpha*K_D))) * S) / ((K_S/alpha)*(1 + (a/K_D))+S*(1 + (a/(alpha*K_D))))
def log_mmat(a, v_max, K_D, alpha):
    S=2.5e6
    K_S=1e6
    return (v_max * (1 + (np.log10(a)/(alpha*K_D))) * S) / ((K_S/alpha)*(1 + (np.log10(a)/K_D))+S*(1 + (np.log10(a)/(alpha*K_D))))

# Filling my dataframes with optimal data (test_data) and measured data (real_data)
lp = np.logspace(-0.5, 4, num=1000)

test_data_x = np.logspace(-0.5, 4, num=12)
test_data_y = np.array([28,35,51,90,173,314,486,625,704,741,757,763])
test_data = pd.DataFrame({'x': test_data_x, 'y': test_data_y})

real_data_y = np.array([ 12.87397621,  12.64915001,  14.22025688,  22.62179769,  41.76414236,
  62.49097641, 179.72713147, 309.08516559, 497.11213079, 449.61759694,
 469.24974154, 360.40709778,  13.67425041,  10.42765308,  23.52110248,
  33.85240147,  47.72738955, 132.63407297, 213.80971215, 290.36035529,
 371.0033705,  414.93547975, 426.62376543, 432.21230229])
real_data_x = np.array([1.00242887e+00, 2.25546495e+00, 5.07479613e+00, 1.14182913e+01,
 2.56911554e+01, 5.78050997e+01, 1.30061474e+02, 2.92638317e+02,
 6.58436214e+02, 1.48148148e+03, 3.33333333e+03, 7.50000000e+03,
 1.00242887e+00, 2.25546495e+00, 5.07479613e+00, 1.14182913e+01,
 2.56911554e+01, 5.78050997e+01, 1.30061474e+02, 2.92638317e+02,
 6.58436214e+02, 1.48148148e+03, 3.33333333e+03, 7.50000000e+03,])
real_data = pd.DataFrame({'x': real_data_x, 'y': real_data_y})

# Function fitting
p0 = [max(test_data['y']), 200, 0.0001]
popt_test, pcov_test = curve_fit(mmat, test_data['x'], test_data['y'], p0, maxfev=10000)

p0 = [max(real_data['y']), 200, 0.0001]
popt_real, pcov_real = curve_fit(mmat, real_data['x'], real_data['y'], p0, maxfev=10000)
popt_log, pcov_log = curve_fit(log_mmat, real_data['x'], real_data['y'], p0, maxfev=100000)

vm_test=popt_test[0]
kd_test=popt_test[1]
al_test=popt_test[2]
vm_real=popt_real[0]
kd_real=popt_real[1]
al_real=popt_real[2]
vm_log=popt_log[0]
kd_log=popt_log[1]
al_log=popt_log[2]

# Plotting
fig = plt.figure(figsize=(25, 10))

ax1 = plt.subplot2grid((2, 5), (0, 0))
ax2 = plt.subplot2grid((2, 5), (0, 1))
ax3 = plt.subplot2grid((2, 5), (1, 0))
ax4 = plt.subplot2grid((2, 5), (1, 1))
ax5 = plt.subplot2grid((2, 5), (1, 2))

ax1.set_title('perfect data - normal x-scale')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.scatter(test_data['x'], test_data['y'], color='black')
ax1.plot(lp, mmat(lp, vm_test, kd_test, al_test), color='orange')

ax2.set_title('perfect data - log x-scale')
ax2.set_xlabel('x')
ax2.scatter(np.log10(test_data['x']), test_data['y'], color='black')
ax2.plot(np.log10(lp), mmat(lp, vm_test, kd_test, al_test), color='orange')

ax3.set_title('real data - normal x-scale')
ax3.set_xlabel('x')
ax3.set_ylabel('y')
ax3.scatter(real_data['x'], real_data['y'], color='black')
ax3.plot(lp, mmat(lp, vm_real, kd_real, al_real), color='blue')

ax4.set_title('real data - log x-scale')
ax4.set_xlabel('x')
ax4.scatter(np.log10(real_data['x']), real_data['y'], color='black')
ax4.plot(np.log10(lp), mmat(lp, vm_real, kd_real, al_real), color='blue')

ax5.set_title('real data - log(x) for fitting')
ax5.set_xlabel('x')
ax5.scatter(np.log10(real_data['x']), real_data['y'], color='black')
ax5.plot(np.log10(lp), log_mmat(lp, vm_log, kd_log, al_log), color='lightgreen')

plt.tight_layout()
plt.show()

Could someone please help me modify this code to fit the equation using x on a logarithmic scale or at least tell me why this wouldn't work? Any tips or suggestions would be greatly appreciated!

Thanks in advance!


Solution

  • Okay, so it seems like I solved my problem and if at some point someone is searching for something similar, here is the answer:

    Scipy's curve_fit function, which uses least-squares fitting, aims to minimize the sum of the squared differences between the calculated and observed y-data. The equation for this is:

    sum(((f(xdata,∗popt)−ydata) / σ)**2)

    Initially, I mistakenly believed that the x-values directly influence the fitting equation. While x-values do play a role, it's less direct. The standard deviation (σ) is significantly higher for higher concentrations. Using these values as-is can greatly affect the fit quality because they are weighted equally with other values, despite having a higher likelihood of being less accurate in absolute terms. This equal weighting happens because the least-squares function defaults σ to one if not specified.

    However, if your dataset includes replicates, you can calculate σ and use it to weight the data points according to it. This ensures that each data point is appropriately considered in the fitting process.

    To show how this looks like, here is the code:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from scipy.optimize import curve_fit
    
    # My functions based on Michaelis-Menten implementing activators
    def mmat(a, v_max, K_D, alpha):
        S=2.5e6
        K_S=1e6
        return (v_max * (1 + (a/(alpha*K_D))) * S) / ((K_S/alpha)*(1 + (a/K_D))+S*(1 + (a/(alpha*K_D))))
    
    # Filling my dataframes with measured data (real_data)
    lp = np.logspace(-0.5, 4, num=1000)
    df = pd.DataFrame({
        'x': [1.00242887e+00, 2.25546495e+00, 5.07479613e+00, 1.14182913e+01,
     2.56911554e+01, 5.78050997e+01, 1.30061474e+02, 2.92638317e+02,
     6.58436214e+02, 1.48148148e+03, 3.33333333e+03, 7.50000000e+03],
        'y1': [ 12.87397621,  12.64915001,  14.22025688,  22.62179769,  41.76414236,
      62.49097641, 179.72713147, 309.08516559, 497.11213079, 449.61759694,
     469.24974154, 360.40709778],
        'y2': [13.67425041,  10.42765308,  23.52110248,
      33.85240147,  47.72738955, 132.63407297, 213.80971215, 290.36035529,
     371.0033705,  414.93547975, 426.62376543, 432.21230229]})
    
    # calculating standard deviation
    df['sigma'] = df.iloc[:,[1,2]].std(axis=1)
    dfm = df.melt(id_vars=['x','sigma'],value_vars=['y1','y2'])
    dfm.columns=['x', 'sigma', 'measurement', 'y']
    
    # Function fitting
    p0 = [max(dfm['y']), 200, 0.0001]
    popt_real, pcov_real = curve_fit(mmat, dfm['x'], dfm['y'],p0, sigma=dfm['sigma'] ,maxfev=10000)
    perr_real = np.sqrt(np.diag(pcov_real))
    
    vm_real=popt_real[0]
    kd_real=popt_real[1]
    al_real=popt_real[2]
    
    # Plotting
    fig = plt.figure(figsize=(20, 7.5))
    
    ax1 = plt.subplot2grid((2, 5), (0, 0))
    ax2 = plt.subplot2grid((2, 5), (0, 1))
    
    ax1.set_title('real data - normal x-scale')
    ax1.set_xlabel('x')
    ax1.set_ylabel('y')
    ax1.scatter(real_data['x'], real_data['y'], color='black')
    ax1.plot(lp, mmat(lp, vm_real, kd_real, al_real), color='blue')
    
    ax2.set_title('real data - log x-scale')
    ax2.set_xlabel('x')
    ax2.scatter(np.log10(real_data['x']), real_data['y'], color='black')
    ax2.plot(np.log10(lp), mmat(lp, vm_real, kd_real, al_real), color='blue')
    
    plt.tight_layout()
    plt.show()
    
    print('vmax: ' + str(popt_real[0]) + ' \u00B1 ' + perr_real[0].astype(str))
    print('K_D: ' + str(popt_real[1]) + ' \u00B1 ' + perr_real[1].astype(str))
    print('alpha: ' + str(popt_real[2]) + ' \u00B1 ' + perr_real[2].astype(str))
    

    Maybe this helps some other lost soul...