pythonscipynumerical-integrationkernel-densityquad

Integration of KDE with strange behavior of from scipy.integrate.quad and the setted bandwith


I was looking for a way to obtaining the mean value (Expected Value) from a drawn distribution that I used to fit a Kernel Density Estimation from scipy.stats.gaussian_kde. I remember from my statistics class that the Expected Value is just the Integral over the pdf(x) * x from -infinity to infinity:

enter image description here

I used the the scipy.integrate.quad function to do this task in my code, but I ran into this apperently strange behavior (that might have something to do with the bandwith parameter from the KDE).

Problem

import matplotlib.pyplot as plt
import numpy as np
import random
from scipy.stats import norm, gaussian_kde
from scipy.integrate import quad
from sklearn.neighbors import KernelDensity

np.random.seed(42)

# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
np.random.normal(loc=4,scale=2.0,size=500)])


kde = gaussian_kde(test_array,bw_method=0.5)


X_range = np.arange(-16,20,0.1)

y_list = []

for X in X_range:

    pdf = lambda x : kde.evaluate([[x]])
    y_list.append(pdf(X))

y = np.array(y_list)    

_ = plt.plot(X_range,y)


# Integrate over pdf * x to obtain the mean
mean_integration_low_bw = quad(lambda x: x * pdf(x), a=-np.inf, b=np.inf)[0]

# Calculate the cdf at point of the mean
zero_int_low = quad(lambda x: pdf(x), a=-np.inf, b=mean_integration_low_bw)[0]

print("The mean after integration: {}\n".format(round(mean_integration_low_bw,4)))

print("F({}): {}".format(round(mean_integration_low_bw,4),round(zero_int_low,4)))

plt.axvline(x=mean_integration_low_bw,color ="r")
plt.show()

If I execute this code I get a strange behavior of the result for the integrated mean and the cumulative distribution function at the point of the calculated mean:

enter image description here

First Question: In my opinion it should always show: F(Mean) = 0.5 or am I wrong here? (Does this only apply to symetric distributions?)

Second Question: The more stranger thing ist, that the value for the integrated mean does not change for the bandwith parameter. In my opinion the mean should change too if the shape of the underlying distribution differs. If i set the bandwith to 5 I got the following graph:

enter image description here

Why is the mean value still the same if the curve now has a different shape (due to the wider bandwith)?

I hope those question not only arise due to my flawed understanding of statistics ;)


Solution

  • Your initial data is generate here

    # Generating sample data
    test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
                                 np.random.normal(loc=4,scale=2.0,size=500)])
    
    

    So you have 500 samples from a distribution with mean 4 and 100 samples from a distribution with mean -10, you can predict the expected average (500*4-10*100)/(500+100) = 1.66666.... that's pretty close to the result given by your code, and also very consistent with the result obtained from the with the first plot.