pythonnumpyscipystatistics

Scipy Gumbel Fit does not Fit - What is the correct shape of the array / dataframe to use?


I'm trying to fit various distributions onto my data and test (chi-squared?) which fits best. I started out by using the gumbel_r distribution of scipy, as this is the one often used in literature.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as ss

data = pd.read_csv("data.csv")
data

sns.histplot(data["score"], kde=True, stat='probability')
plt.show()

x = np.linspace(0,1,101)
hist, bins = np.histogram(data["score"], bins=x, density=True)
loc, scale = ss.gumbel_r.fit(hist)
dist = ss.gumbel_r(loc=loc,scale=scale)
plt.plot(x, dist.pdf(x))
plt.show()

Inspecting the plots yields strange results. For example my data has a peak at ~0.09 of around ~0.025. However, the plotted gumbel looks completely off.

My questions are now:

  1. Why are the plots not looking similar? I'm also suspecting stat='probability' could be the culprit here?
  2. What do I need to do, such that the second plot will look somewhat similar to the first one?
  3. Optimally I would get another hist for the same bins of the fitted distribution and input into scipy.stats.chisquare to quantify how good the fit of the distribution is and see which fits best. Is that correct?

Solution

  • Don't give hist to gumbel_r.fit(). It expects the original data. Change the line that calls fit() to

    loc, scale = ss.gumbel_r.fit(data['score'].to_numpy())
    

    Also, to get the Seaborn plot on the same scale as the plot of the PDF, change stat='probability' to stat='density' in the histplot() call.