I created a simple seaborn kde plots and wonder whether this is a bug.
My code is:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
sns.kdeplot(np.array([1,2]), cmap="Reds", shade=True, bw=0.01)
sns.kdeplot(np.array([2.4,2.5]), cmap="Blues", shade=True, bw=0.01)
plt.show()
The blue and red lines show the kde's of 2 points. If the points are close together, the densities are much narrower compared to the points being further apart. I find this very counter intuitive, at least to the extent that can be seen. I am wondering whether this might be a bug. I also could not find a resource describing how the densities are computed from a set of given points. Any help is appreciated.
The bw_method=
(called bw=
in older versions), is directly passed to scipy.stats.gaussian_kde. The docs there write "If a scalar, this will be used directly as kde.factor
". The explanation of kde.factor
tells "The square of kde.factor
multiplies the covariance matrix of the data in the kde estimation." So, it is a kind of scaling factor. If still more details are needed, you could dive into scipy's source code, or into the research papers referenced in the docs.
If you really want to counter the scaling, you could divide it away: sns.kdeplot(np.array(data), ..., bw_method=0.01/np.std(data))
.
Or you could create your own version of a gaussian kde, with a bandwidth in data coordinates. It just sums some gauss curves and normalizes (total area under the curve should be 1) via dividing by the number of curves.
Here is some example code, with kde curves for 1, 2 or 20 input points:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
def gauss(x, mu=0.0, sigma=1.0):
return np.exp(-((x - mu) / sigma) ** 2 / 2) / (sigma * np.sqrt(2 * np.pi))
def kde(xs, data, sigma=1.0):
return gauss(xs.reshape(-1, 1), data.reshape(1, -1), sigma).sum(axis=1) / len(data)
sns.set()
sigma = 0.03
xs = np.linspace(0, 4, 300)
fig, ax = plt.subplots(figsize=(12, 5))
data1 = np.array([1, 2])
kde1 = kde(xs, data1, sigma=sigma)
ax.plot(xs, kde1, color='crimson', label=f'dist of 1, σ={sigma}')
ax.fill_between(xs, kde1, color='crimson', alpha=0.3)
data2 = np.array([2.4, 2.5])
kde2 = kde(xs, data2, sigma=sigma)
ax.plot(xs, kde2, color='dodgerblue', label=f'dist of 0.1, σ={sigma}')
ax.fill_between(xs, kde2, color='dodgerblue', alpha=0.3)
data3 = np.array([3])
kde3 = kde(xs, data3, sigma=sigma)
ax.plot(xs, kde3, color='limegreen', label=f'1 point, σ={sigma}')
ax.fill_between(xs, kde3, color='limegreen', alpha=0.3)
data4 = np.random.normal(0.01, 0.1, 20).cumsum() + 1.1
kde4 = kde(xs, data4, sigma=sigma)
ax.plot(xs, kde4, color='purple', label=f'20 points, σ={sigma}')
ax.fill_between(xs, kde4, color='purple', alpha=0.3)
ax.margins(x=0) # remove superfluous whitespace left and right
ax.set_ylim(ymin=0) # let the plot "sit" onto y=0
ax.legend()
plt.show()