pythonscipycurve-fittinggaussianbeta-distribution

How can I model the curve of asymmetric peaks using scipy.stats.beta?


I am trying to integrate chromatographic peaks. I need to be able to model the peak shape to consistently determine when the peaks begin and end. I can model Gaussian peaks but many peaks are asymmetrical, as in this example data, and would be better modelled with a Beta distribution. I can fit Gaussian models to the data but I can't seem to obtain Beta models which fit the peak. What am I doing wrong here? I have read many SO posts with similar questions but none of them demonstrate how to plot the derived model over the raw data to show the fit. Eg:

Scipy - How to fit this beta distribution using Python Scipy Curve Fit

Struggling to graph a Beta Distribution using Python

How to properly plot the pdf of a beta function in scipy.stats

Here's my code:

from scipy.stats import beta
import plotly.graph_objects as go
import peakutils

# raw data 
yy = [128459, 1822448, 10216680, 24042041, 30715114, 29537797, 25022446, 18416199, 14138783, 12116635, 9596337, 7201602, 5668133, 4671416, 3920953, 3259980, 2756295, 2326780, 2095209, 1858894, 1646824, 1375129, 1300799, 1253879, 1086045, 968363, 932041, 793707, 741462, 741593]
xx = list(range(0,len(yy)))
fig = go.Figure()
fig.add_trace(go.Scatter(x = xx, y = yy, name = 'raw', mode = 'lines'))

# Gaussian model
gamp, gcen, gwid = peakutils.peak.gaussian_fit(xx, yy, center_only= False) # this calculates the Amplitude, Center & Width of a guassian peak
print(gamp, gcen, gwid)
gauss_model = [peakutils.peak.gaussian(x, gamp, gcen, gwid) for x in xx]
fig.add_trace(go.Scatter(x = xx, y = gauss_model, mode = 'lines', name = 'Gaussian'))

# Beta distribution model
aa, bb, loc, scale = beta.fit(dat)
print('beta fit:', beta_fit)
beta_model = beta.pdf(xx, aa, bb, loc, scale)
fig.add_trace(go.Scatter(x = xx, y = beta_model, mode = 'lines', name = 'Beta'))

fig.show()

enter image description here


Solution

  • I'm assuming dat = np.repeat(xx, yy): xx are the values of the observations, and yy are the frequencies.

    I'll also assume you must use beta to fit discrete data, as specified (it doesn't look like the best choice of distribution, though), and that you want to use maximum likelihood optimization to perform the fit (since that's what your code tries to do right now).

    Your code might work as-is, but you are plotting the PDF, which integrates to 1, against a histogram, which integrates to np.sum(yy). They are going to have very different areas, so they are not comparable. You need to either multiply the PDF by np.sum(yy) or normalize the histogram. I'll do the latter since I don't have the libraries you are using.

    Also, I'll provide reasonable guesses for the loc and scale based on the fact that there were zero observations at np.min(xx)-1 and zero observations at np.max(xx)+1).

    import numpy as np
    from scipy import stats
    import matplotlib.pyplot as plt
    
    yy = [128459, 1822448, 10216680, 24042041, 30715114, 29537797, 25022446, 18416199, 14138783, 12116635, 9596337, 7201602, 5668133, 4671416, 3920953, 3259980, 2756295, 2326780, 2095209, 1858894, 1646824, 1375129, 1300799, 1253879, 1086045, 968363, 932041, 793707, 741462, 741593]
    xx = list(range(0,len(yy)))
    dat = np.repeat(xx, yy)  # asssume this is how `dat` is defined
    
    # estimate `loc` and `scale` manually
    loc = np.min(xx) - 1 
    scale = (np.max(xx) + 1) - loc
    
    # fit a subset of the data to speed up optimization, 
    # guessing `loc` and `scale`
    a, b, loc, scale = stats.beta.fit(dat[::1000], loc=loc, scale=scale)
    
    # make bins to contain the observations nicely
    bins = np.arange(np.min(xx)-0.5, np.max(xx)+0.5)
    
    # plot _normalized_ histogram
    plt.hist(dat, bins=bins, density=True)
    
    # plot pdf
    pdf = stats.beta(a, b, loc, scale).pdf(xx)
    plt.plot(xx, pdf)
    

    enter image description here

    If you have trouble with optimization not converging for the full data set, you can fix loc and scale.

    # _fix_ loc and scale this time (floc=loc, fscale=scale)
    a, b, loc, scale = stats.beta.fit(dat, floc=loc, fscale=scale)
    plt.hist(dat, bins=np.arange(np.min(xx)-0.5, np.max(xx)+0.5), density=True)
    pdf = stats.beta(a, b, loc, scale).pdf(xx)
    plt.plot(xx, pdf)
    

    enter image description here

    If you wanted to plot with the original scale rather than the normalized PDF:

    plt.plot(xx, yy, label='data')
    plt.plot(xx, pdf*np.sum(yy), label='beta fit')
    plt.legend()
    

    enter image description here