rdistance

Mahalanobis distance on R for more than 2 groups


I need to calculate the mahalanobis distance for a numerical dataset of 500 independent observations grouped in 12 groups (species). I know how to compare two matrices , but I do not understand how to calculate mahalanobis distance from my dataset i.e. between the 12 species. R documentation gives

mahalanobis(x, center, cov, inverted = FALSE, ...)

x is the matrix, cov is covariance matrix (cov(x))

but I do not understand how I can calculate the metric for the 12 groups

I found this question on mahalanobis but it does not answer really my question


Solution

  • Getting the distances is straigtforward if you organize your data in a 500 by 12 data.frame or matrix. To show you, first we create a data.frame with some toy data:

    set.seed(1) # To ensure reproducibility of the random numbers
    df <- data.frame(sapply(LETTERS[1:12], function(x) rnorm(500)))
    # Adding some outliers
    df[1,1] <- 20
    df[200,5] <- 60
    head(df)
    #            A           B           C          D           E          F          G          H
    # 1 20.0000000  0.07730312  1.13496509  0.8500435 -0.88614959 -1.8054836  0.7391149  0.5205997
    # 2  0.1836433 -0.29686864  1.11193185 -0.9253130 -1.92225490 -0.6780407  0.3866087  0.3775619
    # 3 -0.8356286 -1.18324224 -0.87077763  0.8935812  1.61970074 -0.4733581  1.2963972 -0.6236588
    # 4  1.5952808  0.01129269  0.21073159 -0.9410097  0.51926990  1.0274171 -0.8035584 -0.5726105
    # 5  0.3295078  0.99160104  0.06939565  0.5389521 -0.05584993 -0.5973876 -1.6026257  0.3125012
    # 6 -0.8204684  1.59396745 -1.66264885 -0.1819744  0.69641761  1.1598494  0.9332510 -0.7074278
    #            I          J          K          L
    # 1 -1.1346302  1.5579537 -1.5163733 -1.1378698
    # 2  0.7645571 -0.7292970  0.6291412 -0.9518105
    # 3  0.5707101 -1.5039509 -1.6781940  1.6192595
    # 4 -1.3516939 -0.5667870  1.1797811  0.1678136
    # 5 -2.0298855 -2.1044536  1.1176545 -0.9081778
    # 6  0.5904787  0.5307319 -1.2377359  1.3417959
    

    where you have the 12 species called A-L. Organized in this way, you simply run the following line:

    dist.sq <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
    

    Remember, the function returns the square of the distances!

    plot(sqrt(dist.sq))
    

    I hope this helps.