pythonscipystatistics

P-Value not matching test statistic in Kolmogorov–Smirnov test (stats.kstest)


I am trying to learn how to use the Kolmogorov–Smirnov test in scipy (stats.kstest). For a simple problem like the example in the documentation, the test statistic for the one sample test seems to be correct, but the p-value does not make sense.

Here is an example where I run 5000 runs of the test statistic against randomly generated normally distributed data.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

pValues = []
statistics = []
for q in range(5000):
    F = stats.norm.rvs(size=5000)
    res = stats.kstest(F, stats.norm.cdf, alternative='two-sided')
    pValues.append(res.pvalue)
    statistics.append(res.statistic)

plt.close('all')
statistics = np.array(statistics)
statCritical = 1.36/(len(F)**0.5)
print(f'fraction above critical stat ({statCritical}): ',len(statistics[statistics>statCritical])/len(statistics))
print(f'fraction below critical stat ({statCritical}): ',len(statistics[statistics<statCritical])/len(statistics))
plt.hist(statistics,bins=50)
plt.tight_layout()
plt.savefig('hist.statistic.png')

plt.close('all')
pValues = np.array(pValues)
print('fraction above pvalue 0.95: ',len(pValues[pValues>0.95])/len(pValues))
print('fraction below pvalue 0.05: ',len(pValues[pValues<0.05])/len(pValues))
plt.hist(pValues,bins=50)
plt.tight_layout()
plt.savefig('hist.pvalue.png')

I got the value for the critical test statistic from here: https://people.cs.pitt.edu/~lipschultz/cs1538/prob-table_KS.pdf

When I run this I get that the fraction of test statistics above the critical value is ~0.05 and below the test statistic is ~0.95 (which is what I expected). The p-values that are returned are essentially uniform and random between 0 and 1. What am I missing?


Solution

  • Your procedure is correct and it reflects what you should expect.

    With the following seed:

    np.random.seed(12345)
    

    As you noticed, at a certain level of confidence, H0 will be rejected (you chose confidence level of 95% so the critical probability is 0.05) in your calculations (D > 0.019233). That is, some drawn sample seems too far away from normality to trust at a certain confidence they actually have been drawn form normal.

    enter image description here

    Test with D statistic greater than your critical value leads to reject H0.

    Now if we check how p-values are distributed, we confirm your observation it is almost certainly uniformly distributed:

    enter image description here

    Where the left observations are tests which reject H0 at the confidence level and right observations are tests which cannot reject H0 at the confidence level.

    So, as expected at a confidence level x % we will reject (100 - x) % of H0. Hence the uniform distribution. In your case, almost 95% of your trials are accepted (H0 is not rejected) and almost 5% rejects H0.

    Relationship between statistic and p-values can be visualized and confirm your independently computed critical value makes sense:

    enter image description here

    As you say, p-value are often difficult to interpret and their implementation may vary from software to another. In this case you can rely on the definition of Wikipedia:

    In null-hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct

    Which is the same as implemented at least in scipy.stats.kstest (see bottom examples for more details).