rmahalanobis

Calculating weigthed pairwise Mahalanobis distances


I know that HDMD package has a function called pairwise.mahalanobis that is supposed to calculate the pairwise Mahalanobis Distance. However, I also want to introduce weights to this distance and it is not feasible with this function. Therefore, I developed my own code. To test whether it functions well, I first kept it simple, i.e. with no weights, and compared its results to that of pairwise.mahalanobis function. However the results did not match... Below is the function I use:

dist.maha <- function (X) {
  diff = pair.diff(X) # pairwise difference of rows 
  V <- cov(X)  ## empirical covariance; positive definite
  L <- t(chol(V))  ## lower triangular factor
  stdX <- t(forwardsolve(L, t(diff)))  # solving the system of linear equations
  return(stdX)
}

And this is its implementation of both alternatives on a toy data:

data = as.matrix(c(100, 54, 56, 79, 12))
dist_manuel = dist.maha(data)

# This is to convert dist_manuel from a vector to a distance matrix 

ind_1 = vector(length = choose(nrow(data),2))
ind_2 = vector(length = choose(nrow(data),2))
k =1  
for (j in 1:(nrow(data)-1)){
  for(i in (j+1):nrow(data)){
    ind_1[k] = i
    ind_2[k] = j    
    k = k + 1
  }
}
dist_manuel = cbind(ind_1,ind_2,dist_manuel)
dist_mat = matrix(data = NA, nrow = nrow(data), ncol = nrow(data))

for (j in 1:(nrow(data)-1)){
  for(i in (j+1):nrow(data)){
    dist_mat[i,j] = dist_manuel[which(dist_manuel[,1] == i & dist_manuel[,2] == j),3]
  }
}

# This is the HDMD alternative 

id = c(1,2,3,4,5)
data = cbind(id,data) 
HDMD = pairwise.mahalanobis(as.data.frame(data[,2]), grouping = data[,1])
dist_HDMD = HDMD$distance

# The outputs

dist_HDMD

#     [,1] [,2] [,3] [,4] [,5]
#[1,]    0    1    4    9   16
#[2,]    1    0    1    4    9
#[3,]    4    1    0    1    4
#[4,]    9    4    1    0    1
#[5,]   16    9    4    1    0

dist_mat

#          [,1]        [,2]       [,3]     [,4] [,5]
#[1,]        NA          NA         NA       NA   NA
#[2,] 1.4002541          NA         NA       NA   NA
#[3,] 1.3393735 -0.06088061         NA       NA   NA
#[4,] 0.6392465 -0.76100768 -0.7001271       NA   NA
#[5,] 2.6787470  1.27849290  1.3393735 2.039501   NA

The results of pairwise.mahalanobis function seems completely absurd to me. For starters, it assigns a distance of 1 for both data[2] & data[3] and data[2] & data[1] which makes no sense when one looks at their values. My function, on the other hand, gives consistent results. For instance, let's compare the ratio of distances between data[1] & data[2] and data[1] & data[3].

(100 - 54) / (100 - 56) = 46/44 = 1.045455

Now, this ratio should hold for the distances my function produces as well.

dist_mat[2,1]/dist_mat[3,1]
#[1] 1.045455

And it does! Does that mean that my function works well while the pairwise.mahalanobis is erroneous ? ( or am I using it incorrectly somehow?) I am not very experienced in R, so I couldn't dare to come to this conclusion by myself. It would be great if someone more experienced than me could confirm my logic.


Solution

  • There is an error in your dist.maha function. This is obvious straight away because some of the distances it computes are negative numbers -- so they cannot be the actual distances! Fortunately, this is quite easy to fix as you simply need to square your stdX vector.

    library("HDMD")
    library("tidyverse")
    
    # Convert a vector of pairwise distances from to a distance matrix
    # (Simplified approach which doesn't use for-loops)
    pairwise_dist_to_dist_matrix <- function(dist, n) {
      stopifnot(length(dist) == n*(n-1)/2)
      mat <- matrix(NA_real_, n, n)
      diag(mat) <- 0
      mat[lower.tri(mat)] <- dist
      mat
    }
    
    dist.maha <- function (X) {
      diff <- pair.diff(X)                 # pairwise difference of rows
      V <- cov(X)                          # empirical covariance; positive definite
      L <- t(chol(V))                      # lower triangular factor
      stdX <- t(forwardsolve(L, t(diff)))  # solving the system of linear equations
      dist <- stdX * stdX                  # Don't forget to square!
      dist <- rowSums(dist)                # And add up the differences in each dimension.
      pairwise_dist_to_dist_matrix(dist, nrow(X))
    }
    
    # An alternative computation, for an additional check
    dist.maha2 <- function(X) {
      diff <- pair.diff(X)
      V <- cov(X)
      Vinv <- solve(V)
      dist <- rowSums(diff %*% Vinv * diff)
      pairwise_dist_to_dist_matrix(dist, nrow(X))
    }
    
    
    # Slightly more complex data matrix to check if
    # functions work in higher dimensions
    data <- matrix(c(100, 54, 56, 79, 12, 1, 2, 3, 4, 5), ncol = 2)
    dist.maha(data)
    #>          [,1]      [,2]     [,3]     [,4] [,5]
    #> [1,] 0.000000        NA       NA       NA   NA
    #> [2,] 2.275210 0.0000000       NA       NA   NA
    #> [3,] 1.974017 0.9742819 0.000000       NA   NA
    #> [4,] 4.759700 7.5842687 3.250906 0.000000   NA
    #> [5,] 7.896067 3.6213875 1.974017 5.690146    0
    dist.maha2(data)
    #>          [,1]      [,2]     [,3]     [,4] [,5]
    #> [1,] 0.000000        NA       NA       NA   NA
    #> [2,] 2.275210 0.0000000       NA       NA   NA
    #> [3,] 1.974017 0.9742819 0.000000       NA   NA
    #> [4,] 4.759700 7.5842687 3.250906 0.000000   NA
    #> [5,] 7.896067 3.6213875 1.974017 5.690146    0
    

    It also seems that you don't use pairwise.mahalanobis correctly. You have to compute and pass the covariance matrix (the cov argument).

    # This is the HDMD alternative
    id = c(1,2,3,4,5)
    
    # Incorrect:
    
    # You have to specify the `cov` argument.
    # Otherwise `pairwise.mahalanobis` doesn't compute it correctly
    # as each sample is assumed to be in its own group.
    
    pairwise.mahalanobis(data, grouping = id)$distance
    #>           [,1]      [,2]      [,3]     [,4]      [,5]
    #> [1,]     0.000 4345.2805 3840.7349  759.689 15362.940
    #> [2,]  4345.280    0.0000   16.7591 1487.209  3369.708
    #> [3,]  3840.735   16.7591    0.0000 1194.197  3840.735
    #> [4,]   759.689 1487.2090 1194.1967    0.000  9310.174
    #> [5,] 15362.940 3369.7076 3840.7349 9310.174     0.000
    
    # Correct:
    
    # NOTE: Ignore the warning message; there seems to be a small bug in `pairwise.mahalanobis`.
    pairwise.mahalanobis(data, grouping = id, cov = cov(data))$distance
    #> Warning in if (dim(cov) != c(p, p)) stop("cov matrix not of dim = (p,p)
    #> \n"): the condition has length > 1 and only the first element will be used
    #>          [,1]      [,2]      [,3]     [,4]     [,5]
    #> [1,] 0.000000 2.2752099 1.9740168 4.759700 7.896067
    #> [2,] 2.275210 0.0000000 0.9742819 7.584269 3.621388
    #> [3,] 1.974017 0.9742819 0.0000000 3.250906 1.974017
    #> [4,] 4.759700 7.5842687 3.2509059 0.000000 5.690146
    #> [5,] 7.896067 3.6213875 1.9740168 5.690146 0.000000
    

    Created on 2019-03-24 by the reprex package (v0.2.1)