I know that HDMD
package has a function called pairwise.mahalanobis
that is supposed to calculate the pairwise Mahalanobis Distance. However, I also want to introduce weights to this distance and it is not feasible with this function. Therefore, I developed my own code. To test whether it functions well, I first kept it simple, i.e. with no weights, and compared its results to that of pairwise.mahalanobis
function. However the results did not match... Below is the function I use:
dist.maha <- function (X) {
diff = pair.diff(X) # pairwise difference of rows
V <- cov(X) ## empirical covariance; positive definite
L <- t(chol(V)) ## lower triangular factor
stdX <- t(forwardsolve(L, t(diff))) # solving the system of linear equations
return(stdX)
}
And this is its implementation of both alternatives on a toy data:
data = as.matrix(c(100, 54, 56, 79, 12))
dist_manuel = dist.maha(data)
# This is to convert dist_manuel from a vector to a distance matrix
ind_1 = vector(length = choose(nrow(data),2))
ind_2 = vector(length = choose(nrow(data),2))
k =1
for (j in 1:(nrow(data)-1)){
for(i in (j+1):nrow(data)){
ind_1[k] = i
ind_2[k] = j
k = k + 1
}
}
dist_manuel = cbind(ind_1,ind_2,dist_manuel)
dist_mat = matrix(data = NA, nrow = nrow(data), ncol = nrow(data))
for (j in 1:(nrow(data)-1)){
for(i in (j+1):nrow(data)){
dist_mat[i,j] = dist_manuel[which(dist_manuel[,1] == i & dist_manuel[,2] == j),3]
}
}
# This is the HDMD alternative
id = c(1,2,3,4,5)
data = cbind(id,data)
HDMD = pairwise.mahalanobis(as.data.frame(data[,2]), grouping = data[,1])
dist_HDMD = HDMD$distance
# The outputs
dist_HDMD
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0 1 4 9 16
#[2,] 1 0 1 4 9
#[3,] 4 1 0 1 4
#[4,] 9 4 1 0 1
#[5,] 16 9 4 1 0
dist_mat
# [,1] [,2] [,3] [,4] [,5]
#[1,] NA NA NA NA NA
#[2,] 1.4002541 NA NA NA NA
#[3,] 1.3393735 -0.06088061 NA NA NA
#[4,] 0.6392465 -0.76100768 -0.7001271 NA NA
#[5,] 2.6787470 1.27849290 1.3393735 2.039501 NA
The results of pairwise.mahalanobi
s function seems completely absurd to me. For starters, it assigns a distance of 1 for both data[2]
& data[3]
and data[2]
& data[1]
which makes no sense when one looks at their values. My function, on the other hand, gives consistent results. For instance, let's compare the ratio of distances between data[1]
& data[2]
and data[1]
& data[3]
.
(100 - 54) / (100 - 56) = 46/44 = 1.045455
Now, this ratio should hold for the distances my function produces as well.
dist_mat[2,1]/dist_mat[3,1]
#[1] 1.045455
And it does! Does that mean that my function works well while the pairwise.mahalanobis
is erroneous ? ( or am I using it incorrectly somehow?) I am not very experienced in R, so I couldn't dare to come to this conclusion by myself. It would be great if someone more experienced than me could confirm my logic.
There is an error in your dist.maha
function. This is obvious straight away because some of the distances it computes are negative numbers -- so they cannot be the actual distances! Fortunately, this is quite easy to fix as you simply need to square your stdX
vector.
library("HDMD")
library("tidyverse")
# Convert a vector of pairwise distances from to a distance matrix
# (Simplified approach which doesn't use for-loops)
pairwise_dist_to_dist_matrix <- function(dist, n) {
stopifnot(length(dist) == n*(n-1)/2)
mat <- matrix(NA_real_, n, n)
diag(mat) <- 0
mat[lower.tri(mat)] <- dist
mat
}
dist.maha <- function (X) {
diff <- pair.diff(X) # pairwise difference of rows
V <- cov(X) # empirical covariance; positive definite
L <- t(chol(V)) # lower triangular factor
stdX <- t(forwardsolve(L, t(diff))) # solving the system of linear equations
dist <- stdX * stdX # Don't forget to square!
dist <- rowSums(dist) # And add up the differences in each dimension.
pairwise_dist_to_dist_matrix(dist, nrow(X))
}
# An alternative computation, for an additional check
dist.maha2 <- function(X) {
diff <- pair.diff(X)
V <- cov(X)
Vinv <- solve(V)
dist <- rowSums(diff %*% Vinv * diff)
pairwise_dist_to_dist_matrix(dist, nrow(X))
}
# Slightly more complex data matrix to check if
# functions work in higher dimensions
data <- matrix(c(100, 54, 56, 79, 12, 1, 2, 3, 4, 5), ncol = 2)
dist.maha(data)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000000 NA NA NA NA
#> [2,] 2.275210 0.0000000 NA NA NA
#> [3,] 1.974017 0.9742819 0.000000 NA NA
#> [4,] 4.759700 7.5842687 3.250906 0.000000 NA
#> [5,] 7.896067 3.6213875 1.974017 5.690146 0
dist.maha2(data)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000000 NA NA NA NA
#> [2,] 2.275210 0.0000000 NA NA NA
#> [3,] 1.974017 0.9742819 0.000000 NA NA
#> [4,] 4.759700 7.5842687 3.250906 0.000000 NA
#> [5,] 7.896067 3.6213875 1.974017 5.690146 0
It also seems that you don't use pairwise.mahalanobis
correctly. You have to compute and pass the covariance matrix (the cov
argument).
# This is the HDMD alternative
id = c(1,2,3,4,5)
# Incorrect:
# You have to specify the `cov` argument.
# Otherwise `pairwise.mahalanobis` doesn't compute it correctly
# as each sample is assumed to be in its own group.
pairwise.mahalanobis(data, grouping = id)$distance
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000 4345.2805 3840.7349 759.689 15362.940
#> [2,] 4345.280 0.0000 16.7591 1487.209 3369.708
#> [3,] 3840.735 16.7591 0.0000 1194.197 3840.735
#> [4,] 759.689 1487.2090 1194.1967 0.000 9310.174
#> [5,] 15362.940 3369.7076 3840.7349 9310.174 0.000
# Correct:
# NOTE: Ignore the warning message; there seems to be a small bug in `pairwise.mahalanobis`.
pairwise.mahalanobis(data, grouping = id, cov = cov(data))$distance
#> Warning in if (dim(cov) != c(p, p)) stop("cov matrix not of dim = (p,p)
#> \n"): the condition has length > 1 and only the first element will be used
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000000 2.2752099 1.9740168 4.759700 7.896067
#> [2,] 2.275210 0.0000000 0.9742819 7.584269 3.621388
#> [3,] 1.974017 0.9742819 0.0000000 3.250906 1.974017
#> [4,] 4.759700 7.5842687 3.2509059 0.000000 5.690146
#> [5,] 7.896067 3.6213875 1.9740168 5.690146 0.000000
Created on 2019-03-24 by the reprex package (v0.2.1)