python scipy statistics kolmogorov-smirnov

Python one-sided KS-Test

I've ran a one-sided KS-test of my distribution (observations of occupation of a mass transit public transportation grid with values ranging from 0 to 100) against a large number of theoretical probability distributions:

cdfs = [
    "norm",            #Normal (Gaussian)
    "alpha",           #Alpha
    "anglit",          #Anglit
    "arcsine",         #Arcsine
    "beta",            #Beta
    "betaprime",       #Beta Prime
    "bradford",        #Bradford
    "burr",            #Burr
    "cauchy",          #Cauchy
     ....
]

for cdf in cdfs:
    #fit our data set against every probability distribution
    parameters = eval("scipy.stats."+cdf+".fit(data_sample)");

    #Applying the Kolmogorov-Smirnof one sided test
    D, p = scipy.stats.kstest(data_sample, cdf, args=parameters);

    #pretty-print the results
    print (cdf.ljust(16) + ("p: "+str('{0:.10f}'.format(p)).ljust(40)+"D: "+str('{0:.10f}'.format(D))));

From what I could understand about the one-sided KS-Test, the theoretical distributions that best fit my data is the one where the one-sided KS-Test returns large p-Values and low D-KSstatistic values.

According to this, the best fits are:

cdf: invweibull p:0.1624542096  D:0.0352622822
cdf: genextreme p:0.1624292228  D:0.0352633673
cdf: nct        p:0.1280588168  D:0.0369024688
cdf: invgamma   p:0.1273446642  D:0.0369401507
cdf: johnsonsu  p:0.0449026953  D:0.0433976894
cdf: invgauss   p:0.0336248605  D:0.0450259762
(...)
cdf: frechet_l  p:0.0000000000  D:0.8405035144
cdf: reciprocal p:0.0000000000  D:0.9380000000
cdf: truncnorm  p:0.0000000000  D:0.9380000000
cdf: powernorm  p:0.0000000000  D:1.0000000000

Furthermore, when I try to visually fit these supposedly best fitted distributions to my data, something is not adding up:

from scipy.stats import invgauss, invweibull, genextreme

fig, ax = plt.subplots(1, 1)
mu = 10.145462645553
x = np.linspace(invgauss.ppf(0.75, mu), invgauss.ppf(0.975, mu), 100)
ax.plot(x, invgauss.pdf(x, mu), 'r-', color='green', lw=1, alpha=0.6, label='invgauss pdf')

c = 0.8
y = np.linspace(invweibull.ppf(0.75, c), invweibull.ppf(0.975, c), 100)
ax.plot(y, invweibull.pdf(y, c), 'r-', color='red', lw=1, alpha=0.6, label='invweibull pdf')

c = -1.5
z = np.linspace(genextreme.ppf(0.75, c), genextreme.ppf(0.96, c), 100)
ax.plot(z, genextreme.pdf(z, c), 'r-', lw=1, color='yellow', alpha=0.6, label='genextreme pdf')

ax.hist(data_sample, normed=True, histtype='stepfilled', bins=20, alpha=0.2, label='my distribution')
ax.legend(loc='best', frameon=False)
plt.show()

The result doesn't seem to match the invgauss, invweibull or genextreme probability distributions to my data.

Am I doing something wrong or assuming something wrong about the KS-test result?

A Data Sample from my distribution:

array([ 29.75,   0.8 ,   9.  ,   4.77,  28.75,  31.1 ,  52.12,   5.  ,
        10.55,  17.26,  19.28,  25.77,  53.13,  28.  ,   4.1 ,   2.92,
        40.4 ,  15.33,  10.62,  20.6 ,  26.11,  15.  ,   5.3 ,  38.87,
         1.28,   1.5 ,  20.88,  16.  ,  10.33,   6.5 ,   6.  ,  22.5 ,
         7.88,   2.72,  60.33,  26.14,  18.  ,  18.58,  25.  ,  69.62,
         0.5 ,   0.  ,  26.87,  11.85,  13.16,  39.45,  17.6 ,  14.66,
        84.52,   3.62,  30.33,   4.25,  25.  ,  35.  ,  28.85,  48.37,
        12.55,  50.  ,  22.94,   7.42,   2.37,  49.66,  22.94,   7.57,
       101.12,   4.42,  43.88,   7.  ,  13.  ,  31.12,  20.71,   0.  ,
        22.  ,  21.34,  23.61,   0.5 ,  16.23,  27.11,   2.22,  59.  ,
        24.41,  41.69,   2.68,  49.  ,  51.6 ,  95.8 ,   0.  ,  26.8 ,
        66.  ,  43.02,  13.85,  46.91,  38.77,   6.5 ,  24.  ,  54.14,
        50.81,  21.55,  19.22,  12.83])

Solution

Please see the accepted answer for more details. Just for future reference, after estimating the correct parameters and pass it to the most similar theoretical distributions that the One-sided KS Test deemed as similar to my own distribution, I was able to visually confirm distribution similarities.

Solution

Short Answer

You make it clear, just left one thing: Different distributions have different parameters. We should pass estimated parameters into distributions and then perform KS-test and your final density plot.

scipy.stats.invgamma.fit(data_sample),\
scipy.stats.norm.fit(data_sample)

((4.399779777260058, -15.382411650381744, 137.60256212682822),
 (24.501099999999997, 21.016423572768037))

In other word, if you want to test your data with various distributions, you should set parameters to each distribution carefully.

First, you fit your data with distributions and get a estimated parameters for each dist.
Next, you perform a KS-test on the estimated distributions(with fitted parameters in the first step).
At last, you should plot the estimated distribution(should pass paramters into each distribution) and your original data to see whether the result of KS-test is trustworthy.

Revised Code

from scipy.stats import bradford,invgauss, invweibull, genextreme

fig, ax = plt.subplots(1, 1)

# set colors for different distributions
colors = ['purple','green','red','yellow'] 
# the distributions you want to compare, add a braford to see if there is any difference
dists = [bradford,invgauss, invweibull, genextreme] 
for cdf in dists:
    # get the paras. Note that the order of parameters is the same with .ppf(parameters), due to the consitiancy of scipy.stats
    parameters =  eval("scipy.stats."+cdf.name+".fit(data_sample)") 
    x = np.linspace(cdf.ppf(0.3, *parameters), cdf.ppf(0.99, *parameters), 100)
    ax.plot(x, cdf.pdf(x, *parameters), 'r-', color=colors[dists.index(cdf)], lw=1, alpha=0.6, label= cdf.name + ' pdf')

ax.hist(data_sample, density=True, histtype='stepfilled', bins=20, alpha=0.2, label='my distribution')
ax.legend(loc='best', frameon=False)
plt.show()

Note

Since we use scipy.stats for parameters' fitting and ppf generation. The order of parameters in each distribution is kept. If you use different packages or software to do the same thing, the order of parameters is not guaranteed to be the same!

For instance, in R, gamma distribution only have 2 parameters: location and shape, which is the same with my statistics text book. However, in python, gamma distribution have 3 parameters. In the past, I have to write my own gamma function in R to generate a gamma distribution with 3 parameters.

Code:

scipy.stats.gamma.fit(data_sample)

Output:

(0.9205943551269966, -6.9329968733926e-26, 27.77113522169767)

As Lucas said in the comments, scipy.stats only provide .fit methods for continuous distributions. If you want fit a discrete distribution, see statsmodels.discrete.discrete_model. As for possion distribution, you can use MLE or moment estimator for fitting lambda, e.g lambda = 1/sample_mean. The estimating method is for you to choose. You can write your own methods in particular situation