rarraysmatrixmatchaverage

create an average distance matrix based on matching prefix column and row names in R


I want to create a new distance matrix from an array of hundreds of distance matrices. This new matrix will be the average distances given a set of pairwise distances. The values to be averaged is based matching name prefixes. Easier to illustrate as my language might be incorrect: Here's one sample distance matrix:

                    SpeciesA_1  SpeciesA_2   SpeciesC_1.  SpeciesC_2
SpeciesA_1      0       0.2         0.3     0.4

SpeciesA_2      0.2     0           0.4     0.5

SpeciesC_1      0.3     0.4         0       0.1

SpeciesC_2      0.4     0.5         0.1     0

I want the mean of all possible pairwise combinations given two prefix. For example:

SpeciesA_1 & SpeciesC_1 (0.3)
SpeciesA_1 & SpeciesC_2 (0.4)
SpeciesA_2 & SpeciesC_1 (0.4)
SpeciesA_2 & SpeciesC_2 (0.5)

Therefore, my resulting new distance matrix in this example should be:

           SpeciesA     SpeciesC
SpeciesA    0          0.4

SpeciesC    0.4         0

My actual matrices have about 140 columns and rows. I have no idea where to even get started with this. These dists are an array:

dists[,,i]

I can't use a loop (which I would normally start with by selecting the first name) because it is an array which I never work with.

dim(dists)
[1] 135 135 138

Here's an example of how I initially made the array of matrices and then filled them in:

labs <- c("A_1","A_2", "B_1", "B_2", "C_1", "C_3", "D_1", "D_3")

dists <- array(0, dim= c(length(labs),length(labs), length(tr)),dimnames=list(labs, labs))```


For testing, I've reduced my dataset to a smaller version for reproducibility and providing it here via dput: ```structure(c(0, 0.0636415002701785, 0.145334597498582, 0.163803183404822, 
0.163803183404822, 0.163803183404822, 0.181754323491637, 2, 2, 
2, 2, 0.0636415002701785, 0, 0.145334597498582, 0.163803183404822, 
0.163803183404822, 0.163803183404822, 0.181754323491637, 2, 2, 
2, 2, 0.145334597498582, 0.145334597498582, 0, 0.163803183404822, 
0.163803183404822, 0.163803183404822, 0.181754323491637, 2, 2, 
2, 2, 0.163803183404822, 0.163803183404822, 0.163803183404822, 
0, 0.054577752568928, 0.109103869804459, 0.181754323491637, 2, 
2, 2, 2, 0.163803183404822, 0.163803183404822, 0.163803183404822, 
0.054577752568928, 0, 0.109103869804459, 0.181754323491637, 2, 
2, 2, 2, 0.163803183404822, 0.163803183404822, 0.163803183404822, 
0.109103869804459, 0.109103869804459, 0, 0.181754323491637, 2, 
2, 2, 2, 0.181754323491637, 0.181754323491637, 0.181754323491637, 
0.181754323491637, 0.181754323491637, 0.181754323491637, 0, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1.33332929092241, 1.33332929092241, 
1.33332929092241, 2, 2, 2, 2, 2, 2, 2, 1.33332929092241, 0, 0.222274233863968, 
0.444424503431041, 2, 2, 2, 2, 2, 2, 2, 1.33332929092241, 0.222274233863968, 
0, 0.444424503431041, 2, 2, 2, 2, 2, 2, 2, 1.33332929092241, 
0.444424503431041, 0.444424503431041, 0, 0, 0.0938002031509043, 
0.0938002031509043, 0, 0.0938002031509043, 0.167066002969562, 
0.167066002969562, 0.445943812443056, 2, 2, 2, 0.0938002031509043, 
0, 0.0317248785850623, 0, 0.0663652495301367, 0.167066002969562, 
0.167066002969562, 0.445943812443056, 2, 2, 2, 0.0938002031509043, 
0.0317248785850623, 0, 0, 0.0663652495301367, 0.167066002969562, 
0.167066002969562, 0.445943812443056, 2, 2, 2, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0.0938002031509043, 0.0663652495301367, 0.0663652495301367, 
0, 0, 0.167066002969562, 0.167066002969562, 0.445943812443056, 
2, 2, 2, 0.167066002969562, 0.167066002969562, 0.167066002969562, 
0, 0.167066002969562, 0, 0.0896758431063197, 0.445943812443056, 
2, 2, 2, 0.167066002969562, 0.167066002969562, 0.167066002969562, 
0, 0.167066002969562, 0.0896758431063197, 0, 0.445943812443056, 
2, 2, 2, 0.445943812443056, 0.445943812443056, 0.445943812443056, 
0, 0.445943812443056, 0.445943812443056, 0.445943812443056, 0, 
2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 
2, 2, 0, 1.49478622213262, 2, 2, 2, 0, 2, 2, 2, 2, 2, 1.49478622213262, 
0, 0, 0.16633562218756, 0.739785630571588, 0.739785630571588, 
0.739785630571589, 0, 0.739785630571589, 0, 0.739785630571589, 
1.33468159813059, 1.33468159813059, 0.16633562218756, 0, 0.739785630571588, 
0.739785630571588, 0.739785630571589, 0, 0.739785630571589, 0, 
0.739785630571589, 1.33468159813059, 1.33468159813059, 0.739785630571588, 
0.739785630571588, 0, 0.246730997741163, 0.493610100538917, 0, 
0.493610100538917, 0, 0.493610100538917, 1.33468159813059, 1.33468159813059, 
0.739785630571588, 0.739785630571588, 0.246730997741163, 0, 0.493610100538917, 
0, 0.493610100538917, 0, 0.493610100538917, 1.33468159813059, 
1.33468159813059, 0.739785630571589, 0.739785630571589, 0.493610100538917, 
0.493610100538917, 0, 0, 0.165591599845893, 0, 0.329233335343189, 
1.33468159813059, 1.33468159813059, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0.739785630571589, 0.739785630571589, 0.493610100538917, 
0.493610100538917, 0.165591599845893, 0, 0, 0, 0.329233335343189, 
1.33468159813059, 1.33468159813059, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0.739785630571589, 0.739785630571589, 0.493610100538917, 
0.493610100538917, 0.329233335343189, 0, 0.329233335343189, 0, 
0, 1.33468159813059, 1.33468159813059, 1.33468159813059, 1.33468159813059, 
1.33468159813059, 1.33468159813059, 1.33468159813059, 0, 1.33468159813059, 
0, 1.33468159813059, 0, 0.999452440298663, 1.33468159813059, 
1.33468159813059, 1.33468159813059, 1.33468159813059, 1.33468159813059, 
0, 1.33468159813059, 0, 1.33468159813059, 0.999452440298663, 
0), .Dim = c(11L, 11L, 3L), .Dimnames = list(c("Az14", "Az14_3", 
"Az201", "Az201_3", "Az92", "Az92_3", "Am", "Am_3", 
"Ap02", "Ap02_3", "og"), c("Az14", "Az14_3", 
"Az201", "Az201_3", "Az92", "Az92_3", "Am", "Am_3", 
"Ap02", "Ap02_3", "og"), NULL))```

Solution

  • Using the array in the updated question (here it is x).

    Get the row/column indices of each unique prefix (assumes row names = column names):

    idx <- split(1:nrow(x), sapply(strsplit(rownames(x), "_"), "[[", 1))
    

    Initialize the final distance matrix:

    m <- array(0, c(rep(length(idx), 2), dim(x)[3]), rep(list(names(idx)), 2))
    

    Calculate the mean distances for each combination of prefixes and put the values into m:

    for (i in 1:(length(idx) - 1)) {
      for (j in (i + 1):length(idx)) {
        m[i, j,] <- m[j, i,] <-
          colMeans(x[idx[[i]], idx[[j]],,drop = FALSE], dims = 2)
      }
    }
    

    Result:

    m[,,1]
    #>             Am      Ap02      Az14     Az201      Az92        og
    #> Am    0.000000 1.6666646 1.0908772 1.0908772 1.0908772 1.6666646
    #> Ap02  1.666665 0.0000000 2.0000000 2.0000000 2.0000000 0.4444245
    #> Az14  1.090877 2.0000000 0.0000000 0.1545689 0.1638032 2.0000000
    #> Az201 1.090877 2.0000000 0.1545689 0.0000000 0.1228220 2.0000000
    #> Az92  1.090877 2.0000000 0.1638032 0.1228220 0.0000000 2.0000000
    #> og    1.666665 0.4444245 2.0000000 2.0000000 2.0000000 0.0000000
    m[,,3]
    #>              Am      Ap02      Az14     Az201      Az92        og
    #> Am    0.0000000 0.4159787 0.3698928 0.2468051 0.0413979 0.6673408
    #> Ap02  0.4159787 0.0000000 1.0372336 0.9141458 0.4159787 1.1670670
    #> Az14  0.3698928 1.0372336 0.0000000 0.7397856 0.3698928 1.3346816
    #> Az201 0.2468051 0.9141458 0.7397856 0.0000000 0.2468051 1.3346816
    #> Az92  0.0413979 0.4159787 0.3698928 0.2468051 0.0000000 0.6673408
    #> og    0.6673408 1.1670670 1.3346816 1.3346816 0.6673408 0.0000000
    
    rowMeans(m, dims = 2)
    #>              Am     Ap02      Az14     Az201      Az92       og
    #> Am    0.0000000 1.360881 0.5890916 0.4969782 0.4731441 1.444668
    #> Ap02  1.3608811 0.000000 1.6790779 1.3047153 1.4719929 1.119628
    #> Az14  0.5890916 1.679078 0.0000000 0.3085786 0.2190901 1.778227
    #> Az201 0.4969782 1.304715 0.3085786 0.0000000 0.1426616 1.444894
    #> Az92  0.4731441 1.471993 0.2190901 0.1426616 0.0000000 1.555780
    #> og    1.4446685 1.119628 1.7782272 1.4448939 1.5557803 0.000000
    identical(rowMeans(m, dims = 2), apply(m, 1:2, mean))
    #> [1] TRUE
    

    As a function:

    groupDist <- function(x) {
      idx <- split(1:nrow(x), sapply(strsplit(rownames(x), "_"), "[[", 1))
      m <- array(0, c(rep(length(idx), 2), dim(x)[3]), rep(list(names(idx)), 2))
      
      for (i in 1:(length(idx) - 1)) {
        for (j in (i + 1):length(idx)) {
          m[i, j,] <- m[j, i,] <-
            colMeans(x[idx[[i]], idx[[j]],,drop = FALSE], dims = 2)
        }
      }
      
      m
    }