I tried to compare the results for the values of cdf from cumsum of pdf and cdf of scipy.stats.norm. Why these are different?
#%%
import numpy as np
from scipy.stats import norm
x=np.arange(10)
m=np.mean(x) # mean of x
v=np.var(x,ddof=1) # variance of x
s=np.std(x,ddof=1) # standard deviation of x
x1=np.linspace(min(x),max(x),10)
y=norm.pdf(x1, loc=m, scale=s)
y=np.cumsum(y)
y=y/y[-1]
print(f'y is : {y}')
y1=norm.cdf(x1, loc=m, scale=s)
y1=y1/y1[-1]
print(f'y1 is : {y1}')
y is : [0.04835861 0.12317281 0.22695357 0.35603744 0.5 0.64396256
0.77304643 0.87682719 0.95164139 1. ]
y1 is : [0.07365228 0.13295909 0.21954114 0.33299004 0.46641076 0.60724152
0.74066224 0.85411114 0.94069318 1. ]
The cumsum
of the pdf
is only an approximation of the cdf
.
Recall that the cdf
, F(x), is the integral of f(x) (the pdf
) from negative infinity to x. The cdf
for scipy.norm
is an analytic function so will be (nearly) exact.
The cumulative sum in your implementation has two issues. 1) You aren't integrating from negative infinity, but instead integrating starting from your lower bound in x1
which is only about 1.5sigma
away from the mean, so there is some error introduced there. And 2) cumsum
effectively assumes that the function value is constant across the dx
value in your x1
array, which is not true either. If you choose a larger number of points in your x1
array you will get closer agreement between the two functions (though the error in point 1) will still be there.