pythonscipypower-law

python scipy stats pareto fit: how does it work


... help and online documentation say the function scipy.stats.pareto.fit takes as variables the dataset to be fitted, and optionally b (the exponent), loc, scale. the result comes as triplet (exponent, loc, scale)

generating data from the same distribution should result in the fit finding the parameters used for generating the data, e.g. (using the python 3 colsole)

$  python
Python 3.3.0 (default, Dec 12 2012, 07:43:02) 
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

(in code lines below leaving out the python console prompt ">>>")

dataset=scipy.stats.pareto.rvs(1.5,size=10000)  #generating data
scipy.stats.pareto.fit(dataset)

however this results in

(1.0, nan, 0.0)

(exponent 1, should be 1.5) and

dataset=scipy.stats.pareto.rvs(1.1,size=10000)  #generating data
scipy.stats.pareto.fit(dataset)

results in

(1.0, nan, 0.0)

(exponent 1, should be 1.1) and

dataset=scipy.stats.pareto.rvs(4,loc=2.0,scale=0.4,size=10000)    #generating data
scipy.stats.pareto.fit(dataset)

(exponent should be 4, loc should be 2, scale should be 0.4) in

(1.0, nan, 0.0)

etc. giving another exponent when calling the fit function

scipy.stats.pareto.fit(dataset,1.4)

returns always exactly this exponent

(1.3999999999999999, nan, 0.0)

The obvious question would be: do I misunderstand the purpose of this fit function completely, is it used somehow differently, or is it simply broken?

a remark: before someone mentions that dedicated functions like those given on Aaron Clauset's web pages (http://tuvalu.santafe.edu/~aaronc/powerlaws/) are more reliable than the scipy.stats methods and should be used instead: that may be true, but they are also very very very very time consuming and do for datasets of 10000 points take many many hours (maybe days, weeks, years) on a normal PC.

edit: oh: the parameter of the fit function is not the exponent of the distribution but exponent minus 1 (but this does not change the above issue)


Solution

  • The fit method is a very general and simple method that does optimize.fmin on the non-negative likelihood function (self.nnlf) for the distribution. In distributions like pareto which have parameters that can create undefined regions, the general method doesn't work.

    In particular, the general nnlf method returns "inf" when the value of the random-variable doesn't fit into domain of validity of the distribution. The "fmin" optimizer doesn't play well with this objective function unless you have guessed the starting value very closely to the ultimate fit.

    In general, the .fit method needs to use a constrained optimizer for distributions where there are limits on the domain of applicability of the pdf.