I have an array of shape (n, timesteps)
, where n
is the number of trials and timesteps
is the length of each trial. Each value of this array denotes a stochastic measurement.
I would like to implement a generic function that computes a confidence interval for a given statistic (mean, median, ...) assuming 1) the underlying distribution is normal, 2) is Student's.
Something like
def normal_ci(
data: np.array,
axis: int = 0,
statistic: Callable = np.mean,
confidence: float = 0.95
):
and a similar function student_ci()
.
My problem is that, by default, scipy functions compute intervals for the mean, am I right? Like in this answer.
This is on Stack Overflow, so the computational answer is to use the bootstrap.
from typing import Callable
import numpy as np
from scipy import stats
rng = np.random.default_rng(84354894298246)
def normal_ci(
data: np.array,
axis: int = 0,
statistic: Callable = np.mean,
confidence: float = 0.95
):
res = stats.bootstrap((data,), statistic, axis=axis,
confidence_level=confidence)
return tuple(res.confidence_interval)
# generate 1000 samples, each of length 100
data = rng.standard_normal(size=(1000, 100))
# compute the confidence interval for each
low, high = normal_ci(data, axis=-1)
# the confidence interval contains the population value
# of the statistic in 95% of cases
np.mean((low < 0) & (0 < high)) # 0.953
Since you know the families from which your data are drawn, you could look into the parametric bootstrap.
def normal_ci(
data: np.array,
axis: int = 0,
statistic: Callable = np.mean,
confidence: float = 0.95
):
# fit a normal distribution to the data
mean = np.mean(data, axis=axis)
std = np.std(data, ddof=1, axis=axis)
# resample data from the fitted normal distribution
n_resamples = 999
m = data.shape[axis] # assuming axis is an integer
resample = rng.normal(loc=mean, scale=std, size=(m, n_resamples) + mean.shape)
# compute the statistic for each of the resamples to estimate
# the distribution
statistic_distribution = statistic(resample, axis=0)
# Generate the confidence interval
# percentile bootstrap (very crude)
alpha = 1 - confidence
low, high = np.quantile(statistic_distribution, [alpha/2, 1-alpha/2], axis=0)
return low, high
low, high = normal_ci(data, axis=-1)
np.mean((low < 0) & (0 < high)) # 0.954
SciPy's bootstrap
function uses the BCa method by default, which is reasonably sophisticated; this parametric bootstrap is what I could write in a few minutes. So for parametric bootstrap, you really should research what other libraries are out there.
If you're not happy with the bootstrap, you'd need to do some research: for each of the statistics and population families you're interested in, you need to know the distribution of the statistic of a sample drawn from your population. For a sample from a normal distribution, the sample mean is t-distributed, the variance is chi-squared distributed, etc... and that is not a question for Stack Overflow.