statisticsdata-analysismeanoverlappingstdev

How to get the means and standard deviations of two overlapping normal distributions?


I have the following plot:

Overlapping normal distributions

I would like to estimate the means and standard deviations of the apparent overlapping normal distributions. This is slightly complicated by the fact that since the data is based on hour of the day, it also is circular -- the right end of the tail(s) leak into the left end.

How do I handle this?


Solution

  • I'd like to thank Robert Dodier and Adrian Keister for the start and the GitHub project provided by Emily Grace Ripka: Peak fitting Jupyter notebook

    I was able to approximate the two different overlapped distributions with von Mises distributions and then optimized the predictions to minimize the error by selecting the mean and kappa (equivalent to the standard deviation of a von Mises distribution).

    I was able to accomplish this with the SciPy Python module classes: scipy.stats.vonmises and scipy.optimize.curve_fit

    I created the following two helper functions:

    def two_von_mises(x, amp1, cen1, kappa1, amp2, cen2, kappa2):
        return (amp1 * vonmises.pdf(x-cen1, kappa1)) + \
               (amp2 * vonmises.pdf(x-cen2, kappa2))
    
    def one_von_mises(x, amp, cen, kappa):
        return amp * vonmises.pdf(x-cen, kappa)
    

    I needed to convert the time of day to an interval range from -pi <= {time of day} < pi, like so:

    hourly_df['Angle'] = ((two_pi * hourly_df['HourOfDay']) / 24) - np.pi
    

    I was then able to use the curve_fit function of the scipy.optimize module like so:

    popt, pcov = curve_fit(two_von_mises, hourly_df['Angle'], hourly_df['Count'], p0 = [1, 11, 1, 1, 18, 1])
    

    From this I got all the estimates of the parameters for the two distributions (from the popt variable above):

    array([1.66877995e+04, 2.03310292e+01, 2.03941267e+00, 3.61717300e+04,
           2.46426705e+01, 1.32666704e+00])
    

    Plotting this we see: Data with superimposed von Mises pdf graphed The next steps will be to see if we can determine what distribution a query belongs to based on categorical data collected for each query, but that is another story...

    Thanks!