I am trying to estimate the the density of a data set at certain points, using scipy.
from scipy.stats import gaussian_kde
import numpy as np
I have a dataset A
of 3D points (this is just a minimal example. My actual data has many more dimensions and many more samples)
A = np.array([[0.078377 , 0.76737392, 0.45038174],
[0.65990129, 0.13154658, 0.30770917],
[0.46068406, 0.22751313, 0.28122463]])
and the points at which I want to estimate the density
B = np.array([[0.40209377, 0.21063273, 0.75885516],
[0.91709997, 0.79303252, 0.65156937]])
But I can't seem to be able to use the gaussian_kde
function, as
result = gaussian_kde(A.T)(B.T)
returns
LinAlgError: Matrix is not positive definite
How do I fix this error? How do I get the density of my sample?
You have highly correlated features in your data which leads to a numerical error. There are several possible ways to address this, each with pros and cons. A drop-in replacement class for gaussian_kde
is proposed below.
Your dataset
(i.e. the matrix that you feed when creating the gaussian_kde
object, not when using it), likely contains highly dependent features. This fact (possibly combined with having low numerical resolution like float32
and "too many" datapoints) causes the covariance matrix to have near-zero eigenvalues, which due to numerical error can get below zero. This in turn will break the code that uses the Cholesky decomposition on the dataset covariance matrix (see explanation for specific details).
Assuming your dataset has shape (dims, N)
you can test if this is your problem via np.linalg.eigh(np.cov(dataset))[0] <= 0
. If any of the outputs comes out True
, let me be the first welcoming you to the club.
The idea is to get all eigenvalues above zero.
Increasing numerical resolution to the highest float that is practical can be an easy fix and worth trying, but may not be enough.
Given the fact that this is caused by correlated features, removing datapoints doesn't help much a priori. There is a slim hope that having less numbers to crush will propagate less error, and keep the eigenvalues above zero. It's easy to implement but it discards data points.
A more involved fix is to identify the highly correlated features and merge them or ignore the "superfluous" ones. This is tricky especially if the correlations among dimensions vary from instance to instance.
Probably the cleanest way is to leave the data untouched, and lift the problematic eigenvalues to positive values. This is usually done in 2 ways:
SVD addresses the problem directly at its core: Decompose the covariance matrix and replace the negative eigenvalues with a small positive epsilon
. This will put your matrix back to PD domain introducing minimal error.
If the SVD is too expensive to compute, an alternative numerical hack is to add epsilon * np.eye(D)
to the covariance matrix. This has the effect of adding epsilon
to each one of the eigenvalues. Again, if epsilon
is small enough, this doesn't introduce that much of an error.
If you go for the last approach you'll need to tell gaussian_kde
to modify its covariance matrix. This is a relatively clean way I found to do that: simply add this class to your codebase and replace gaussian_kde
with GaussianKde
(tested on my end, seems to work fine).
class GaussianKde(gaussian_kde):
"""
Drop-in replacement for gaussian_kde that adds the class attribute EPSILON
to the covmat eigenvalues, to prevent exceptions due to numerical error.
"""
EPSILON = 1e-10 # adjust this at will
def _compute_covariance(self):
"""Computes the covariance matrix for each Gaussian kernel using
covariance_factor().
"""
self.factor = self.covariance_factor()
# Cache covariance and inverse covariance of the data
if not hasattr(self, '_data_inv_cov'):
self._data_covariance = np.atleast_2d(np.cov(self.dataset, rowvar=1,
bias=False,
aweights=self.weights))
# we're going the easy way here
self._data_covariance += self.EPSILON * np.eye(
len(self._data_covariance))
self._data_inv_cov = np.linalg.inv(self._data_covariance)
self.covariance = self._data_covariance * self.factor**2
self.inv_cov = self._data_inv_cov / self.factor**2
L = np.linalg.cholesky(self.covariance * 2 * np.pi)
self._norm_factor = 2*np.log(np.diag(L)).sum() # needed for scipy 1.5.2
self.log_det = 2*np.log(np.diag(L)).sum() # changed var name on 1.6.2
In case your error is similar, but not quite that, or anyone feels curious, here is the process I followed, hopefully it helps.
The exception stack specified that the error was triggered during a Cholesky decomposition. In my case, this was this line inside the _compute_covariance
method.
Indeed, after the exception, checking the Eigenvalues for the covariance
and inv_cov
attributes via np.eigh
showed that covariance
had a close-to-zero negative eigenvalue, and hence its inverse had a huge one. Since Cholesky expects PD matrices, this triggered an error.
At this point we can heavily suspect that the tiny, negative eigenvalue is a numerical error, since covariance matrices are PSD. Indeed, the error source comes when the covariance matrix is originally computed from the dataset that has been passed to the constructor, here. In my case, the covariance matrix yielded at least 2 almost identical columns, which led to the residual negative eigenvalue due to numerical error.
When will your dataset lead to a quasi-singular covariance matrix? That question is perfectly addressed in this other SE post. The bottom line is: If some variable is an exact linear combination of the other variables, with constant term allowed, the correlation and covariance matrices of the variables will be singular. The dependency observed in such matrix between its columns is actually that same dependency as the dependency between the variables in the data observed after the variables have been centered (their means brought to 0) or standardized (if we mean correlation rather than covariance matrix)
(Kudos and +1 to ttnphns for the amazing work).