I need to run a chi-square test on my dataset to find the p-value. The obvious choice is to use chi2_contingency() and chi2.cdf() from scipy.stats. But the p-value = 5.723076338262742e-82 is so tiny that it takes 3 seconds just to compute this simple dataset. I want to avoid this slow process by setting a custom threshold in chi2.cdf(). If the p-value is much smaller than 0.01, I don't think it's worth the computational effort to calculate it.
My example dataset is:
# Observed data
observed = np.array([[150, 700], [350, 150]])
# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(observed)
# Print the results
print(f"Chi2 Statistic: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)
# Results
Chi2 Statistic: 367.7704987889273
P-value: 5.723076338262742e-82
Degrees of Freedom: 1
Expected Frequencies:
[[314.81481481 535.18518519]
[185.18518519 314.81481481]]
I tried to bypass the computation, but even this approach compares the p-value with the threshold posteriori.
from scipy.stats import chi2, chi2_contingency
# Observed data
observed = np.array([[150, 700], [350, 150]])
# Perform the chi-square test
chi2_stat, p_value, dof, expected = chi2_contingency(observed)
# Set your threshold (for example, 0.01)
threshold = 0.01
# Check if p-value is below the threshold
if p_value < threshold:
print(f'P-value is extremely small (<{threshold}). Skipping the exhaustive computation.')
else:
# Compute the actual p-value
p_value = 1 - chi2.cdf(chi2_stat, dof)
print(f'P-value: {p_value}')
To wrap it up, I’m looking for a programming way to avoid calculating the p-value each time — only if it’s >= 0.01. Looking forward for your input!
You can implement the calculation of the statistic yourself to avoid having chi2_contingency
perform the p-value calculation, but I don't think it's worth your time because chi2_contingency(observed)
takes less than half of a millisecond on Google Colab for your data.
%timeit chi2_contingency(observed)
# 415 µs ± 30.8 µs per loop (mean ± std.)
Calculating the p-value itself accounts for a fraction of that, and it will not depend noticeably on the values. The distribution infrastructure has a ton of overhead; the underlying special function call is only a few microseconds (and even that is mostly data-independent overhead).
I imagine the time you are observing is really the imports or the statistic calculation (if your real data is different from the example you've given here), but if chi2_contingency
is really so slow on your machine, submit a bug report with SciPy (https://github.com/scipy/scipy/issues).