pythonpandasscipybootstrappingerrorbar

Get the errorbars to bins of a dataset by using bootstrapping


I have some data called variable that I can simulate in the following way:

import pandas as pd
import numpy as np
from scipy.stats import bootstrap

random_values = np.random.uniform(low=0, high=10, size=10000)
df = pd.DataFrame({"variable": random_values})

I want to bin my data in 5 bins within the bins bins = [0, 2, 4, 6, 8, 10] and calculate to each of the bins error-bars with some bootstrapping method, e.g. the confidence interval of the 95% percent level. I figured out that the cumbersome thing is to calculate the error bars. I could do it with scipy.stats.bootstrap and then do

bootstrap(one_of_the_bins, my_statistic, confidence_level=0.95, method='percentile')

but it requires that I split my data into chunks according to the bins and loop over the chunks. So I wonder is there a more convenient way to do this, is there some functionality integrated in pandas for that? Or can I provide to scipy.stats my full data and the bins and then scipy will do the calculations for all the bins together? Thank you for any advice!


Solution

  • It can do the calculation with all the bins together:

    import numpy as np
    from scipy.stats import bootstrap, binned_statistic
    
    x = np.random.uniform(low=0, high=10, size=10000)
    
    def statistic(x):
        res = binned_statistic(x, x, statistic='mean',
                               bins=[0, 2, 4, 6, 8, 10])
        return res.statistic
    
    res = bootstrap((x,), statistic)
    res.confidence_interval
    # ConfidenceInterval(low=array([0.99653325, 2.969743  , 4.99033544, 6.98312963, 8.9515727 ]), 
    #                    high=array([1.04843922, 3.02077679, 5.04092083, 7.03426957, 9.0015762 ]))
    

    Note the theoretical difference between

    You can more easily visualize the difference if the statistic parameter of binned_statistic were count. Suppose your observed data is binned with the following counts per bin:

    binned_statistic(x, x, statistic='count', bins=[0, 2, 4, 6, 8, 10])
    # BinnedStatisticResult(statistic=array([1917., 2013., 1998., 2005., 2067.]) ... 
    

    If you were to pre-bin the data, then there would be no uncertainty in the counts in each bin. On the other hand, if you include the binning as part of the statistic, you might get something like:

    res.confidence_interval
    # ConfidenceInterval(low=array([1841., 1936., 1921., 1929., 1990.]), 
    #                    high=array([1993., 2091., 2078., 2084.05, 2147.]))