python-3.xkolmogorov-smirnovscipy.stats

Scipy's ks_2samp function gives good D_statistic but wrong p_value


I am trying to perform a two-sample Kolmogorov-Smirnov test to check whether two samples come from the same population. Here is the code to reproduce my problem:

from scipy.stats import ks_2samp
import numpy as np

x = list(np.random.normal(10, 1, 3000))
y = list(np.random.normal(12, 1, 2000))
d_statistic, p_value = ks_2samp(x, y)

With scipy versions older than 1.3, I get the following results: d_statistic = 0.67317 and p_value = 0.0

However with scipy versions >= 1.3: d_statistic = 0.6705 and p_value = 0.9904774590824749

Both give almost the same d_statistic but the most recent versions of scipy seem to give me a wrong p_value and I do not understand why. Indeed x and y are clearly two samples which do not come from the same population.

I did some research and since scipy==1.3 the 'exact' mode was released and is the default mode for small samples (len(x), len(y) <= 10000 which is my case). However, if I change the mode from 'exact' to 'asymp' I get the same results as the ones I get from the oldest scipy versions.

d_statistic, p_value = ks_2samp(x, y, mode='asymp')

Is there a problem with the 'exact' mode when computing the p_value or am I missing something?

Thanks for your help, h1t5uj1


Solution

  • For those who will face the same problem as I did. It is a bug that appear when sample sizes are over a few thousand (credit to pvanmulbregt that solved this issue: https://github.com/scipy/scipy/issues/11184). It should be solved in the version 1.5.0 of scipy. In the mean time you can change the mode from 'exact' to 'asymp' or you can just downgrade your scipy version.

    Hope this helps, H1t5uj1