pythonmatplotlibhistogramprobability-density

Can't get y-axis on Matplotlib histogram to display probabilities


I have data (pd Series) that looks like (daily stock returns, n = 555):

S = perf_manual.returns
S = S[~((S-S.mean()).abs()>3*S.std())]

2014-03-31 20:00:00    0.000000
2014-04-01 20:00:00    0.000000
2014-04-03 20:00:00   -0.001950
2014-04-04 20:00:00   -0.000538
2014-04-07 20:00:00    0.000764
2014-04-08 20:00:00    0.000803
2014-04-09 20:00:00    0.001961
2014-04-10 20:00:00    0.040530
2014-04-11 20:00:00   -0.032319
2014-04-14 20:00:00   -0.008512
2014-04-15 20:00:00   -0.034109
...

I'd like to generate a probability distribution plot from this. Using:

print stats.normaltest(S)

n, bins, patches = plt.hist(S, 100, normed=1, facecolor='blue', alpha=0.75)
print np.sum(n * np.diff(bins))

(mu, sigma) = stats.norm.fit(S)
print mu, sigma
y = mlab.normpdf(bins, mu, sigma)
plt.grid(True)
l = plt.plot(bins, y, 'r', linewidth=2)

plt.xlim(-0.05,0.05)
plt.show()

I get the following:

NormaltestResult(statistic=66.587382579416982, pvalue=3.473230376732532e-15)
1.0
0.000495624926242 0.0118790391467

graph

I have the impression the y-axis is a count, but I'd like to have probabilities instead. How do I do that? I've tried a whole lot of StackOverflow answers and can't figure this out.


Solution

  • There is no easy way (that I know of) to do that using plt.hist. But you can simply bin the data using np.histogram and then normalize the data any way you want. If I understood you correctly, you want the data to display the probability to find a point in a given bin, NOT the probability distribution. That means you have to scale your data that the sum over all bins is 1. That can simply be done by doing bin_probability = n/float(n.sum()).

    You will then not have a properly normalized probability distribution function (pdf) anymore, meaning that the integral over an interval will not be a probability! That is the reason, why you have to rescale your mlab.normpdf to have the same norm as your histogram. The factor needed is just the bin width, because when you start from the properly normalized binned pdf the sum over all bins times their respective width is 1. Now you want to have just the sum of bins equal to 1. So the scaling factor is the bin width.

    Therefore, the code you end up with is something along the lines of:

    import numpy as np
    import scipy.stats as stats
    import matplotlib.pyplot as plt
    import matplotlib.mlab as mlab
    
    # Produce test data
    S = np.random.normal(0, 0.01, size=1000)
    
    # Histogram:
    # Bin it
    n, bin_edges = np.histogram(S, 100)
    # Normalize it, so that every bins value gives the probability of that bin
    bin_probability = n/float(n.sum())
    # Get the mid points of every bin
    bin_middles = (bin_edges[1:]+bin_edges[:-1])/2.
    # Compute the bin-width
    bin_width = bin_edges[1]-bin_edges[0]
    # Plot the histogram as a bar plot
    plt.bar(bin_middles, bin_probability, width=bin_width)
    
    # Fit to normal distribution
    (mu, sigma) = stats.norm.fit(S)
    # The pdf should not normed anymore but scaled the same way as the data
    y = mlab.normpdf(bin_middles, mu, sigma)*bin_width
    l = plt.plot(bin_middles, y, 'r', linewidth=2)
    
    plt.grid(True)
    plt.xlim(-0.05,0.05)
    plt.show()
    

    And the resulting picture will be:

    enter image description here