javadata-miningmahoutdata-cleaningmahalanobis

The Mahalanobis distance between a point and the mean vector is always the same


I‘m trying to perform some data cleansing algorithm recently. When I try to calculate the mahalanobis distance between points in the data set and the mean vector, it seems the same.

For example, I have a data set like:

{{2,2,3},{4,5,9},{7,8,9}}

The mean vector is :

{13/3,5,7}

And the covariance matrix is:

{{6.333333333333333,7.5,7.0},{7.5,9.0,9.0},{7.0,9.0,12.0}}

Then the distances between {2,2,3}, {4,5,9}, {7,8,9} and the mean vector are all 8290542, which is quite strange. After calculating on paper, the result is the same.

Does anyone know what's wrong with my code or thought? I'd be more than grateful if someone could help me out. Following is some code I used in dealing with this problem.

import org.apache.commons.math3.linear.RealMatrix;
import org.apache.commons.math3.stat.correlation.Covariance;
import org.apache.mahout.math.*;
import org.apache.mahout.common.distance.MahalanobisDistanceMeasure;

public class Test {

    public static void main(String[] args) {
        double[] a = {2,2,3};
        Vector aVector = new DenseVector(a);

        double[] b = {4,5,9};
        Vector bVector = new DenseVector(b);

        double[] c = {7,8,9};
        Vector cVector = new DenseVector(b);


        double[] mean = {13/3,5,7};
        Vector meanVector = new DenseVector(mean);

        MahalanobisDistanceMeasure measure = new MahalanobisDistanceMeasure();

        double[][] ma = {{2,2,3},{4,5,9},{7,8,9}};
        RealMatrix matrix = new Covariance(ma).getCovarianceMatrix();
        Matrix math = new DenseMatrix(matrix.getData());

        measure.setCovarianceMatrix(math);

        measure.setMeanVector(meanVector);
        System.out.println(matrix.toString());
        System.out.println(measure.distance(meanVector,cVector));

    }


}

Solution

  • You need to use more data.

    The mean vector + covariance matrix will otherwise overfit to your data, and give the same distance each.

    For 3d data, use at least 20 points.