rclassificationcluster-analysismahalanobispairwise-distance

Pairwise Mahalanobis Distance Between Multiple Independent Variables and a Dependent Variable With Three Classes in R


Issue

I'm having some trouble calculating the Mahalanobis Distance between three classes of the independent variable Country.

My aim is to calculate the Mahalanobis distance among dolphin whistle acoustic parameters measured from a spectrogram taken from three populations in three different study regions. I want to calculate potential (dis)similarity amongst whistle types.

Dataframe structure

'data.frame':   367 obs. of  10 variables:
 $ Country    : Factor w/ 3 levels "Holland","France",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Low.Freq   : num  -0.245 -1.846 -1.369 -2.022 2.045 ...
 $ High.Freq  : num  -0.684 -2.033 -2.128 -2.643 0.358 ...
 $ Peak.Freq  : num  -0.0879 -1.3636 -0.8557 -1.4291 0.2001 ...
 $ Delta.Freq : num  -0.384 -0.588 -0.922 -0.973 -0.664 ...
 $ Delta.Time : num  -1.48 -1.124 -0.861 -0.999 -1.696 ...
 $ Peak.Time  : num  0.0383 0.3698 0.3703 -0.7444 -0.3214 ...
 $ Center.Freq: num  -0.169 -1.157 -1.032 -1.556 0.16 ...
 $ Start.Freq : num  -0.741 -0.944 -1.149 -1.712 0.905 ...
 $ End.Freq   : num  -0.335 -1.495 -1.561 -2.242 0.704 ...

I found a StackOverflow question that covered exactly what I want to achieve here but the problem is they conduct a pairwise mahalanobis distance analysis using the function pairwise.mahalanobis() from the HDMD package. However, when I try and download the package, I get this message.

Warning Message

Warning in install.packages :
  package ‘HDMD’ is not available for this version of R
A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

#Ive tried these different methods to find the package but nothing works

ap <- available.packages()
View(ap)
"HDMD" %in% rownames(ap)

install.packages("HDMD", dependencies = TRUE)

av <- available.packages(filters=list())
av[av[, "Package"] == HDMD, ]

install.packages('HDMD',repos='http://cran.us.r-project.org')

I can't seem to find another function that behaves in the same way as the function pairwise.mahalanobis() except for a loop on this StackOverflow question, here, which I also could not get to work.

Does anyone know how to achieve the type of analysis I'm attempting because I can't figure out how?

If anyone can lend a hand, I would be very grateful.

R-CODE

    library(HDMD) #pairwise.mahalanobis function
    library(cluster) #agnes function
    library(biotools)

    group = matrix(Cluster_Dummy_2$Country) #what is being compared
    group = t(group[,1]) #prepare for pairwise.mahalanobis function
    
    variables = c(variables = c("Low.Freq", "High.Freq", "Peak.Freq", "Delta.Freq", "Delta.Time", "Peak.Time", 
                                "Center.Freq", "Start.Freq","End.Freq")) #variables (what is being used for comparison)
    variables = as.matrix(Cluster_Dummy_2[,variables]) #prepare for pairwise.mahalanobis function

The original code has the function pairwise.mahalanobis() in the HDMD package:

mahala_sq = pairwise.mahalanobis(x=variables, grouping=group) #get squared mahalanobis distances (see mahala_sq$distance).

I decided to try D2.dist() using the biotools package

mahala_sq = D2.dist(variables, group) #get squared mahalanobis distances (see mahala_sq$distance).
names = rownames(mahala_sq$means) #capture labels

#Error 
Error in D2.dist(variables, group, inverted = FALSE) : 
  incompatible dimensions!

Rest of the code to produce a dendrogram

mahala = sqrt(mahala_sq$distance) #mahalanobis distance
rownames(mahala) = names #set rownames in the dissimilarity matrix
colnames(mahala) = names #set colnames in the dissimilarity matrix

#This is how I used the dissimilarity matrix to find clusters.
cluster = agnes(mahala,diss=TRUE,keep.diss=FALSE,method="complete") #hierarchical clustering
plot(cluster,which.plots=2) #plot dendrogram

Data frame

structure(list(Low.Freq = c(435L, 94103292L, 1L, 2688L, 8471L, 
    28818L, 654755585L, 468628164L, 342491L, 2288474L, 3915L, 411L, 
    267864894L, 3312618L, 5383L, 8989443L, 1894L, 534981L, 9544861L, 
    3437614L, 475386L, 7550764L, 48744L, 2317845L, 5126197L, 2445L, 
    8L, 557450L, 450259742L, 21006647L, 9L, 7234027L, 59L, 9L, 605L, 
    9199L, 3022L, 30218156L, 46423L, 38L, 88L, 396396244L, 28934316L, 
    7723L, 95688045L, 679354L, 716352L, 76289L, 332826763L, 6L, 90975L, 
    83103577L, 9529L, 229093L, 42810L, 5L, 18175302L, 1443751L, 5831L, 
    8303661L, 86L, 778L, 23947L, 8L, 9829740L, 2075838L, 7434328L, 
    82174987L, 2L, 94037071L, 9638653L, 5L, 3L, 65972L, 0L, 936779338L, 
    4885076L, 745L, 8L, 56456L, 125140L, 73043989L, 516476L, 7L, 
    4440739L, 612L, 3966L, 8L, 9255L, 84127L, 96218L, 5690L, 56L, 
    3561L, 78738L, 1803363L, 809369L, 7131L, 0L), High.Freq = c(6071L, 
    3210L, 6L, 7306092L, 6919054L, 666399L, 78L, 523880161L, 4700783L, 
    4173830L, 30L, 811L, 341014L, 780L, 44749L, 91L, 201620707L, 
    74L, 1L, 65422L, 595L, 89093186L, 946520L, 6940919L, 655350L, 
    4L, 6L, 618L, 2006697L, 889L, 1398L, 28769L, 90519642L, 984L, 
    0L, 296209525L, 487088392L, 5L, 894L, 529L, 5L, 99106L, 2L, 926017L, 
    9078L, 1L, 21L, 88601017L, 575770L, 48L, 8431L, 194L, 62324996L, 
    5L, 81L, 40634727L, 806901520L, 6818173L, 3501L, 91780L, 36106039L, 
    5834347L, 58388837L, 34L, 3280L, 6507606L, 19L, 402L, 584L, 76L, 
    4078684L, 199L, 6881L, 92251L, 81715L, 40L, 327L, 57764L, 97668898L, 
    2676483L, 76L, 4694L, 817120L, 51L, 116712L, 666L, 3L, 42841L, 
    9724L, 21L, 4L, 359L, 2604L, 22L, 30490L, 5640L, 34L, 51923625L, 
    35544L), Peak.Freq = c(87005561L, 9102L, 994839015L, 42745869L, 
    32840L, 62737133L, 2722L, 24L, 67404881L, 999242982L, 3048L, 
    85315406L, 703037627L, 331264L, 8403609L, 3934064L, 50578953L, 
    370110665L, 3414L, 12657L, 40L, 432L, 7707L, 214L, 68588962L, 
    69467L, 75L, 500297L, 704L, 1L, 102659072L, 60896923L, 4481230L, 
    94124925L, 60164619L, 447L, 580L, 8L, 172L, 9478521L, 20L, 53L, 
    3072127L, 2160L, 27301893L, 8L, 4263L, 508L, 712409L, 50677L, 
    522433683L, 112844L, 193385L, 458269L, 93578705L, 22093131L, 
    6L, 9L, 1690461L, 0L, 4L, 652847L, 44767L, 21408L, 5384L, 304L, 
    721L, 651147L, 2426L, 586L, 498289375L, 945L, 6L, 816L, 46207L, 
    39135L, 6621028L, 66905L, 26905085L, 4098L, 0L, 14L, 88L, 530L, 
    97809006L, 90L, 6L, 260792844L, 9L, 833205723L, 99467321L, 5L, 
    8455640L, 54090L, 2L, 309L, 299161148L, 4952L, 454824L), Delta.Freq = c(5L, 
    78L, 88553L, 794L, 5L, 3859122L, 782L, 36L, 8756801L, 243169338L, 
    817789L, 8792384L, 7431L, 626921743L, 9206L, 95789L, 7916L, 8143453L, 
    6L, 4L, 6363L, 181125L, 259618L, 6751L, 33L, 37960L, 0L, 2L, 
    599582228L, 565585L, 19L, 48L, 269450424L, 70676581L, 7830566L, 
    4L, 86484313L, 21L, 90899794L, 2L, 72356L, 574280L, 869544L, 
    73418L, 6468164L, 2259L, 5938505L, 31329L, 1249L, 354L, 8817L, 
    3L, 2568L, 82809L, 29836269L, 5230L, 37L, 33752014L, 79307L, 
    1736L, 8522076L, 40L, 2289135L, 862L, 801448L, 8026L, 5L, 15L, 
    4393771L, 405914L, 71098L, 950288L, 8319L, 1396973L, 832L, 70L, 
    1746L, 61907L, 8709547L, 300750537L, 45862L, 91417085L, 79892L, 
    47765L, 5477L, 18L, 4186L, 2860L, 754038591L, 375L, 53809223L, 
    72L, 136L, 509L, 232325L, 13128104L, 1692L, 8581L, 23L), Delta.Time = c(1361082L, 
    7926L, 499L, 5004L, 3494530L, 213L, 64551179L, 70L, 797L, 5L, 
    72588L, 86976L, 5163L, 635080L, 3L, 91L, 919806257L, 81443L, 
    3135427L, 4410972L, 5810L, 8L, 46603718L, 422L, 1083626L, 48L, 
    15699890L, 7L, 90167635L, 446459879L, 2332071L, 761660L, 49218442L, 
    381L, 46L, 493197L, 46L, 798597155L, 45342274L, 6265842L, 6L, 
    3445819L, 351L, 1761227L, 214L, 959L, 908996387L, 6L, 3855L, 
    9096604L, 152664L, 7970052L, 32366926L, 31L, 5201618L, 114L, 
    7806411L, 70L, 239L, 5065L, 2L, 1L, 14472831L, 122042249L, 8L, 
    495604L, 29L, 8965478L, 2875L, 959L, 39L, 9L, 690L, 933626665L, 
    85294L, 580093L, 95934L, 982058L, 65244056L, 137508L, 29L, 7621L, 
    7527L, 72L, 2L, 315L, 6L, 2413L, 8625150L, 51298109L, 851L, 890460L, 
    160736L, 6L, 850842734L, 2L, 7L, 76969113L, 190536L), Peak.Time = c(1465265L, 
    452894L, 545076172L, 8226275L, 5040875L, 700530L, 1L, 3639L, 
    20141L, 71712131L, 686L, 923L, 770569738L, 69961L, 737458636L, 
    122403L, 199502046L, 6108L, 907L, 108078263L, 7817L, 4L, 6L, 
    69L, 721L, 786353L, 87486L, 1563L, 876L, 47599535L, 79295722L, 
    53L, 7378L, 591L, 6607935L, 954L, 6295L, 75514344L, 5742050L, 
    25647276L, 449L, 328566184L, 4L, 2L, 2703L, 21367543L, 63429043L, 
    708L, 782L, 909820L, 478L, 50L, 922L, 579882L, 7850L, 534L, 2157492L, 
    96L, 6L, 716L, 5L, 653290336L, 447854237L, 2L, 31972263L, 645L, 
    7L, 609909L, 4054695L, 455631L, 4919894L, 9L, 72713L, 9997L, 
    84090765L, 89742L, 5L, 5028L, 4126L, 23091L, 81L, 239635020L, 
    3576L, 898597785L, 6822L, 3798L, 201999L, 19624L, 20432923L, 
    18944093L, 930720236L, 1492302L, 300122L, 143633L, 5152743L, 
    417344L, 813L, 55792L, 78L), Center_Freq = c(61907L, 8709547L, 
    300750537L, 45862L, 91417085L, 79892L, 47765L, 5477L, 18L, 4186L, 
    2860L, 754038591L, 375L, 53809223L, 72L, 136L, 4700783L, 4173830L, 
    30L, 811L, 341014L, 780L, 44749L, 91L, 201620707L, 74L, 1L, 65422L, 
    595L, 89093186L, 946520L, 6940919L, 48744L, 2317845L, 5126197L, 
    2445L, 8L, 557450L, 450259742L, 21006647L, 9L, 7234027L, 59L, 
    9L, 651547554L, 45554L, 38493L, 91055218L, 38L, 1116474L, 2295482L, 
    3001L, 9L, 3270L, 141L, 53644L, 667983L, 565598L, 84L, 971L, 
    555498297L, 60431L, 6597L, 856943893L, 607815536L, 4406L, 79L, 
    4885076L, 745L, 8L, 56456L, 125140L, 73043989L, 516476L, 7L, 
    4440739L, 754038591L, 375L, 53809223L, 72L, 136L, 509L, 232325L, 
    13128104L, 1692L, 8581L, 23L, 5874213L, 4550L, 644668065L, 3712371L, 
    5928L, 8833L, 7L, 2186023L, 61627221L, 37297L, 716427989L, 21387L
    ), Start.Freq = c(426355L, 22073538L, 680374L, 41771L, 54L, 6762844L, 
    599171L, 108L, 257451851L, 438814L, 343045L, 4702L, 967787L, 
    1937L, 18L, 89301735L, 366L, 90L, 954L, 7337732L, 70891703L, 
    4139L, 10397931L, 940000382L, 7L, 38376L, 878528819L, 6287L, 
    738366L, 31L, 47L, 5L, 6L, 77848L, 2366508L, 45L, 3665842L, 7252260L, 
    6L, 61L, 3247L, 448348L, 1L, 705132L, 144L, 7423637L, 2L, 497L, 
    844927639L, 78978L, 914L, 131L, 7089563L, 927L, 9595581L, 2774463L, 
    1651L, 73509280L, 7L, 35L, 18L, 96L, 1L, 92545512L, 27354947L, 
    7556L, 65019L, 7480L, 71835L, 8249L, 64792L, 71537L, 349389666L, 
    280244484L, 82L, 6L, 40L, 353872L, 0L, 103L, 1255L, 4752L, 29L, 
    76L, 81185L, 14L, 9L, 470775630L, 818361265L, 57947209L, 44L, 
    24L, 41295L, 4L, 261449L, 9931404L, 773556640L, 930717L, 65007421L
    ), End.Freq = c(71000996L, 11613579L, 71377155L, 1942738L, 8760748L, 
    79L, 455L, 374L, 8L, 5L, 2266932L, 597833L, 155488L, 3020L, 4L, 
    554L, 4L, 16472L, 1945649L, 668181101L, 649780L, 22394365L, 93060602L, 
    172146L, 20472L, 23558847L, 190513L, 22759044L, 44L, 78450L, 
    205621181L, 218L, 69916344L, 23884L, 66L, 312148L, 7710564L, 
    4L, 422L, 744572L, 651547554L, 45554L, 38493L, 91055218L, 38L, 
    1116474L, 2295482L, 3001L, 9L, 3270L, 141L, 55595L, 38451L, 8660867L, 
    14L, 96L, 345L, 6L, 44L, 8235824L, 910517L, 1424326L, 87102566L, 
    53644L, 667983L, 565598L, 84L, 971L, 555498297L, 60431L, 6597L, 
    856943893L, 607815536L, 4406L, 79L, 7L, 28978746L, 7537295L, 
    6L, 633L, 345860066L, 802L, 1035131L, 602L, 2740L, 8065L, 61370968L, 
    429953765L, 981507L, 8105L, 343787257L, 44782L, 64184L, 12981359L, 
    123367978L, 818775L, 123745614L, 25345654L, 3L), Country = c("Holland", 
    "Holland", "Holland", "Holland", "Holland", "Holland", "Spain", 
    "Spain", "Spain", "Spain", "Spain", "Spain", "Spain", "Spain", 
    "Spain", "Spain", "Spain", "Spain", "Holland", "Holland", "Holland", 
    "Holland", "Holland", "Holland", "France", "France", "France", 
    "France", "France", "France", "France", "France", "France", "France", 
    "France", "Spain", "Spain", "Spain", "Spain", "Spain", "Spain", 
    "Spain", "Spain", "France", "France", "France", "France", "Holland", 
    "Holland", "Holland", "Holland", "Holland", "Holland", "Holland", 
    "Holland", "Holland", "Holland", "Holland", "Holland", "Holland", 
    "Spain", "Spain", "Spain", "Spain", "Spain", "Spain", "Spain", 
    "Holland", "Holland", "Holland", "Holland", "France", "France", 
    "France", "France", "France", "France", "France", "Spain", "Spain", 
    "Spain", "Spain", "Spain", "Spain", "Spain", "Spain", "Spain", 
    "Spain", "Spain", "Spain", "Spain", "Spain", "Spain", "Spain", 
    "Spain", "Spain", "France", "France", "France")), row.names = c(NA, 
    99L), class = "data.frame")

Solution

  • Code to download the function pairwise.mahalanobis() from the HDMD package

    #install.packages("biotools")
    #install.packages("cluster")
    
    library(biotools) #pairwise.mahalanobis function
    library(cluster) #agnes function
    
    group = matrix(Cluster_Dummy_2$Country) #what is being compared
    group = t(group[,1]) #prepare for pairwise.mahalanobis function
    
    variables = c(variables = c("Low.Freq", "High.Freq", "Peak.Freq", "Delta.Freq", "Delta.Time", "Peak.Time", 
                                "Center.Freq", "Start.Freq","End.Freq")) #variables (what is being used for comparison)
    
    variables = as.matrix(Cluster_Dummy_2[,variables]) #prepare for pairwise.mahalanobis function
    variables
    
    #function for calculating the Mahalanobis distance
    
    #REF:https://cran.r-project.org/src/contrib/Archive/HDMD/
    
    #HDMD Package download
    
    pairwise.mahalanobis = 
      function(x, grouping=NULL, cov =NULL, inverted=FALSE, digits = 5, ...) 
      {
        x <- if (is.vector(x))                                  #standardize input data as matrix
          matrix(x, ncol = length(x))
        else as.matrix(x)
        
        if(!is.matrix(x))
          stop("x could not be forced into a matrix")
        
        if(length(grouping) ==0){                               #no group assigned, uses first col
          grouping = t(x[1])
          x = x[2:dim(x)[2]]    
          cat("assigning grouping\n")
          print(grouping)
        }
        
        n <- nrow(x)
        p <- ncol(x)
        
        if (n != length(grouping)){                                 #grouping and matrix do not correspond
          cat(paste("n: ", n, "and groups: ", length(grouping), "\n"))
          stop("nrow(x) and length(grouping) are different")
        }
        g <- as.factor(grouping)
        g
        lev <- lev1 <- levels(g)                                # Groups
        counts <- as.vector(table(g))                           # No. of elements in each group
        
        if (any(counts == 0)) {                             # Remove grouping if not represented in data
          empty <- lev[counts == 0]
          warning(sprintf(ngettext(length(empty), "group %s is empty", 
                                   "groups %s are empty"), paste(empty, collapse = " ")), 
                  domain = NA)
          lev1 <- lev[counts > 0]
          g <- factor(g, levels = lev1)
          counts <- as.vector(table(g))
        }
        
        ng = length(lev1)
        
        group.means <- tapply(x, list(rep(g, p), col(x)), mean)     # g x p matrix of group means from x
        
        if(missing(cov)){                                           #create covariance matrix
          inverted = FALSE
          cov = cor(x)                                          #standardize into correlation mtx
        }
        else{                                                       #check cov of correct dimension
          if(dim(cov) != c(p,p))
            stop("cov matrix not of dim = (p,p)\n")
        }
        
        Distance = matrix(nrow=ng, ncol=ng)                                 #initialize distance matrix 
        dimnames(Distance) = list(names(group.means), names(group.means))
        
        Means = round(group.means, digits)
        Cov = round(cov, digits)
        Distance = round(Distance, digits)
        
        for(i in 1:ng){
          Distance[i,]= mahalanobis(group.means, group.means[i,], cov, inverted)
        }
        
        result <- list(means = group.means, cov = cov, distance = Distance)
        result
      }
    
    Calculate the pairwise Mahalanobis distance
    
    #Calculate the Mahalanobis distance
    mahala_sq = pairwise.mahalanobis(x=variables, grouping=group) #get squared mahalanobis distances (see mahala_sq$distance).
    mahala_sq
    
    names = rownames(mahala_sq$means) #capture labels
    
    mahala = sqrt(mahala_sq$distance) #mahalanobis distance
    rownames(mahala) = names #set rownames in the dissimilarity matrix
    colnames(mahala) = names #set colnames in the dissimilarity matrix
    
    mahala 
    
    #This is how I used the dissimilarity matrix to find clusters.
    cluster = agnes(mahala,diss=TRUE,keep.diss=FALSE,method="complete") #hierarchical clustering
    plot(cluster,which.plots=2) #plot dendrogram