I've ran a one-sided KS-test of my distribution (observations of occupation of a mass transit public transportation grid with values ranging from 0 to 100) against a large number of theoretical probability distributions:
cdfs = [
"norm", #Normal (Gaussian)
"alpha", #Alpha
"anglit", #Anglit
"arcsine", #Arcsine
"beta", #Beta
"betaprime", #Beta Prime
"bradford", #Bradford
"burr", #Burr
"cauchy", #Cauchy
....
]
for cdf in cdfs:
#fit our data set against every probability distribution
parameters = eval("scipy.stats."+cdf+".fit(data_sample)");
#Applying the Kolmogorov-Smirnof one sided test
D, p = scipy.stats.kstest(data_sample, cdf, args=parameters);
#pretty-print the results
print (cdf.ljust(16) + ("p: "+str('{0:.10f}'.format(p)).ljust(40)+"D: "+str('{0:.10f}'.format(D))));
From what I could understand about the one-sided KS-Test, the theoretical distributions that best fit my data is the one where the one-sided KS-Test returns large p-Values and low D-KSstatistic values.
According to this, the best fits are:
cdf: invweibull p:0.1624542096 D:0.0352622822
cdf: genextreme p:0.1624292228 D:0.0352633673
cdf: nct p:0.1280588168 D:0.0369024688
cdf: invgamma p:0.1273446642 D:0.0369401507
cdf: johnsonsu p:0.0449026953 D:0.0433976894
cdf: invgauss p:0.0336248605 D:0.0450259762
(...)
cdf: frechet_l p:0.0000000000 D:0.8405035144
cdf: reciprocal p:0.0000000000 D:0.9380000000
cdf: truncnorm p:0.0000000000 D:0.9380000000
cdf: powernorm p:0.0000000000 D:1.0000000000
Furthermore, when I try to visually fit these supposedly best fitted distributions to my data, something is not adding up:
from scipy.stats import invgauss, invweibull, genextreme
fig, ax = plt.subplots(1, 1)
mu = 10.145462645553
x = np.linspace(invgauss.ppf(0.75, mu), invgauss.ppf(0.975, mu), 100)
ax.plot(x, invgauss.pdf(x, mu), 'r-', color='green', lw=1, alpha=0.6, label='invgauss pdf')
c = 0.8
y = np.linspace(invweibull.ppf(0.75, c), invweibull.ppf(0.975, c), 100)
ax.plot(y, invweibull.pdf(y, c), 'r-', color='red', lw=1, alpha=0.6, label='invweibull pdf')
c = -1.5
z = np.linspace(genextreme.ppf(0.75, c), genextreme.ppf(0.96, c), 100)
ax.plot(z, genextreme.pdf(z, c), 'r-', lw=1, color='yellow', alpha=0.6, label='genextreme pdf')
ax.hist(data_sample, normed=True, histtype='stepfilled', bins=20, alpha=0.2, label='my distribution')
ax.legend(loc='best', frameon=False)
plt.show()
The result doesn't seem to match the invgauss, invweibull or genextreme probability distributions to my data.
Am I doing something wrong or assuming something wrong about the KS-test result?
A Data Sample from my distribution:
array([ 29.75, 0.8 , 9. , 4.77, 28.75, 31.1 , 52.12, 5. ,
10.55, 17.26, 19.28, 25.77, 53.13, 28. , 4.1 , 2.92,
40.4 , 15.33, 10.62, 20.6 , 26.11, 15. , 5.3 , 38.87,
1.28, 1.5 , 20.88, 16. , 10.33, 6.5 , 6. , 22.5 ,
7.88, 2.72, 60.33, 26.14, 18. , 18.58, 25. , 69.62,
0.5 , 0. , 26.87, 11.85, 13.16, 39.45, 17.6 , 14.66,
84.52, 3.62, 30.33, 4.25, 25. , 35. , 28.85, 48.37,
12.55, 50. , 22.94, 7.42, 2.37, 49.66, 22.94, 7.57,
101.12, 4.42, 43.88, 7. , 13. , 31.12, 20.71, 0. ,
22. , 21.34, 23.61, 0.5 , 16.23, 27.11, 2.22, 59. ,
24.41, 41.69, 2.68, 49. , 51.6 , 95.8 , 0. , 26.8 ,
66. , 43.02, 13.85, 46.91, 38.77, 6.5 , 24. , 54.14,
50.81, 21.55, 19.22, 12.83])
Solution
Please see the accepted answer for more details. Just for future reference, after estimating the correct parameters and pass it to the most similar theoretical distributions that the One-sided KS Test deemed as similar to my own distribution, I was able to visually confirm distribution similarities.
You make it clear, just left one thing: Different distributions have different parameters. We should pass estimated parameters into distributions and then perform KS-test and your final density plot.
scipy.stats.invgamma.fit(data_sample),\
scipy.stats.norm.fit(data_sample)
((4.399779777260058, -15.382411650381744, 137.60256212682822),
(24.501099999999997, 21.016423572768037))
In other word, if you want to test your data with various distributions, you should set parameters to each distribution carefully.
from scipy.stats import bradford,invgauss, invweibull, genextreme
fig, ax = plt.subplots(1, 1)
# set colors for different distributions
colors = ['purple','green','red','yellow']
# the distributions you want to compare, add a braford to see if there is any difference
dists = [bradford,invgauss, invweibull, genextreme]
for cdf in dists:
# get the paras. Note that the order of parameters is the same with .ppf(parameters), due to the consitiancy of scipy.stats
parameters = eval("scipy.stats."+cdf.name+".fit(data_sample)")
x = np.linspace(cdf.ppf(0.3, *parameters), cdf.ppf(0.99, *parameters), 100)
ax.plot(x, cdf.pdf(x, *parameters), 'r-', color=colors[dists.index(cdf)], lw=1, alpha=0.6, label= cdf.name + ' pdf')
ax.hist(data_sample, density=True, histtype='stepfilled', bins=20, alpha=0.2, label='my distribution')
ax.legend(loc='best', frameon=False)
plt.show()
Since we use scipy.stats
for parameters' fitting and ppf generation. The order of parameters in each distribution is kept. If you use different packages or software to do the same thing, the order of parameters is not guaranteed to be the same!
For instance, in R, gamma distribution only have 2 parameters: location and shape, which is the same with my statistics text book. However, in python, gamma distribution have 3 parameters. In the past, I have to write my own gamma function in R to generate a gamma distribution with 3 parameters.
Code:
scipy.stats.gamma.fit(data_sample)
Output:
(0.9205943551269966, -6.9329968733926e-26, 27.77113522169767)
As Lucas said in the comments, scipy.stats
only provide .fit
methods for continuous distributions. If you want fit a discrete distribution, see statsmodels.discrete.discrete_model. As for possion distribution, you can use MLE or moment estimator for fitting lambda, e.g lambda = 1/sample_mean. The estimating method is for you to choose. You can write your own methods in particular situation