pythonscipycumsumcdf

why the cumsum of pdf and cdf are different for scipy.stats.norm?


I tried to compare the results for the values of cdf from cumsum of pdf and cdf of scipy.stats.norm. Why these are different?

#%%
import numpy as np
from scipy.stats import norm


x=np.arange(10)

m=np.mean(x) # mean of x
v=np.var(x,ddof=1) # variance of x
s=np.std(x,ddof=1)  # standard deviation of x

x1=np.linspace(min(x),max(x),10)
y=norm.pdf(x1, loc=m, scale=s)
y=np.cumsum(y)
y=y/y[-1]
print(f'y is : {y}')

y1=norm.cdf(x1, loc=m, scale=s)
y1=y1/y1[-1]
print(f'y1 is : {y1}')

y is : [0.04835861 0.12317281 0.22695357 0.35603744 0.5        0.64396256
 0.77304643 0.87682719 0.95164139 1.        ]
y1 is : [0.07365228 0.13295909 0.21954114 0.33299004 0.46641076 0.60724152
 0.74066224 0.85411114 0.94069318 1.        ]


Solution

  • The cumsum of the pdf is only an approximation of the cdf.

    Recall that the cdf, F(x), is the integral of f(x) (the pdf) from negative infinity to x. The cdf for scipy.norm is an analytic function so will be (nearly) exact.

    The cumulative sum in your implementation has two issues. 1) You aren't integrating from negative infinity, but instead integrating starting from your lower bound in x1 which is only about 1.5sigma away from the mean, so there is some error introduced there. And 2) cumsum effectively assumes that the function value is constant across the dx value in your x1 array, which is not true either. If you choose a larger number of points in your x1 array you will get closer agreement between the two functions (though the error in point 1) will still be there.