I'm trying to fit various distributions onto my data and test (chi-squared?) which fits best. I started out by using the gumbel_r
distribution of scipy, as this is the one often used in literature.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as ss
data = pd.read_csv("data.csv")
data
sns.histplot(data["score"], kde=True, stat='probability')
plt.show()
x = np.linspace(0,1,101)
hist, bins = np.histogram(data["score"], bins=x, density=True)
loc, scale = ss.gumbel_r.fit(hist)
dist = ss.gumbel_r(loc=loc,scale=scale)
plt.plot(x, dist.pdf(x))
plt.show()
Inspecting the plots yields strange results. For example my data has a peak at ~0.09 of around ~0.025. However, the plotted gumbel looks completely off.
My questions are now:
stat='probability'
could be the culprit here?hist
for the same bins of the fitted distribution and input into scipy.stats.chisquare
to quantify how good the fit of the distribution is and see which fits best. Is that correct?Don't give hist
to gumbel_r.fit()
. It expects the original data. Change the line that calls fit()
to
loc, scale = ss.gumbel_r.fit(data['score'].to_numpy())
Also, to get the Seaborn plot on the same scale as the plot of the PDF, change stat='probability'
to stat='density'
in the histplot()
call.