I have some data called variable
that I can simulate in the following way:
import pandas as pd
import numpy as np
from scipy.stats import bootstrap
random_values = np.random.uniform(low=0, high=10, size=10000)
df = pd.DataFrame({"variable": random_values})
I want to bin my data in 5 bins within the bins bins = [0, 2, 4, 6, 8, 10]
and calculate to each of the bins error-bars with some bootstrapping method, e.g. the confidence interval of the 95% percent level. I figured out that the cumbersome thing is to calculate the error bars. I could do it with scipy.stats.bootstrap
and then do
bootstrap(one_of_the_bins, my_statistic, confidence_level=0.95, method='percentile')
but it requires that I split my data into chunks according to the bins and loop over the chunks. So I wonder is there a more convenient way to do this, is there some functionality integrated in pandas for that? Or can I provide to scipy.stats
my full data and the bins and then scipy will do the calculations for all the bins together? Thank you for any advice!
It can do the calculation with all the bins together:
import numpy as np
from scipy.stats import bootstrap, binned_statistic
x = np.random.uniform(low=0, high=10, size=10000)
def statistic(x):
res = binned_statistic(x, x, statistic='mean',
bins=[0, 2, 4, 6, 8, 10])
return res.statistic
res = bootstrap((x,), statistic)
res.confidence_interval
# ConfidenceInterval(low=array([0.99653325, 2.969743 , 4.99033544, 6.98312963, 8.9515727 ]),
# high=array([1.04843922, 3.02077679, 5.04092083, 7.03426957, 9.0015762 ]))
Note the theoretical difference between
You can more easily visualize the difference if the statistic
parameter of binned_statistic
were count
. Suppose your observed data is binned with the following counts per bin:
binned_statistic(x, x, statistic='count', bins=[0, 2, 4, 6, 8, 10])
# BinnedStatisticResult(statistic=array([1917., 2013., 1998., 2005., 2067.]) ...
If you were to pre-bin the data, then there would be no uncertainty in the counts in each bin. On the other hand, if you include the binning as part of the statistic, you might get something like:
res.confidence_interval
# ConfidenceInterval(low=array([1841., 1936., 1921., 1929., 1990.]),
# high=array([1993., 2091., 2078., 2084.05, 2147.]))