rdata.tablematch

Match data.table and matrix and compute new column


I have two data.tables "commuters" and "distance". Distance is a huge distance matrix with row and column indices. Commuters has a column "home" and "destination".

commuters (1646044 rows):

home destination
2 2
1 2
3 3
1 4
1 4

distance (1187 rows):

1 2 3 4
1 238.23 453 263 1
2 21.2 1 238.23 238.23
3 577.98 238.23 4362 2443.22
4 234.12 987.98 89.93 12.21

I would like to add a new column "distance" to df commuters. So, if the value in "home" matches with the row index of "distance" and "destination" matches with the column index of "distance", I want to add the corresponding value in the matrix to this new column "distance" in "commuters". The problem is that "home" sometimes the same value multiple times (that is why there are more rows in commuters than in distance). R keeps giving me the error that the rows in "cpmmuters" do not match the rows in "distance". However, I would like to keep duplicates (as you can see in the following table: "1 | 4" appears twice).

desired output:

home destination distance
2 2 1
1 2 453
3 3 4362
1 4 1
1 4 1

What I have tried:

commuters$distance <- distance[cbind(commuters$home, commuters$destination)]

I get this error: i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please report to data.table issue tracker if you'd like this, or add your comments to FR #657.

How can I fix this?


Solution

  • No need to do anything fancy (no melting required): the base [ accessor for matrices allows for a 2-column matrix to specify row/column indices. Your error, however, is because your "matrix" (as you call it) is really a data.table. If it were a frame or a matrix, it works:

    as.data.table(distance)[cbind(commuters$home, commuters$destination)]
    # Error in `[.data.table`(as.data.table(distance), cbind(commuters$home,  : 
    #   i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please report to data.table issue tracker if you'd like this, or add your comments to FR #657.
    as.data.frame(distance)[cbind(commuters$home, commuters$destination)]
    # [1]    1  453 4362    1    1
    as.matrix(distance)[cbind(commuters$home, commuters$destination)]
    # [1]    1  453 4362    1    1
    

    So if commuters really is a data.table, we can do

    commuters[, dist := as.matrix(distance)[cbind(home, destination)] ]
    #     home destination  dist
    #    <int>       <int> <num>
    # 1:     2           2     1
    # 2:     1           2   453
    # 3:     3           3  4362
    # 4:     1           4     1
    # 5:     1           4     1
    

    Data

    commuters <- data.table::as.data.table(structure(list(home = c(2L, 1L, 3L, 1L, 1L), destination = c(2L, 2L, 3L, 4L, 4L)), row.names = c(NA, -5L), class = c("data.table", "data.frame")))
    # the matrix version of distance ...
    # to reproduce the error, convert this to a data.frame or data.table
    distance <- structure(c(238.23, 21.2, 577.98, 234.12, 453, 1, 238.23, 987.98, 263, 238.23, 4362, 89.93, 1, 238.23, 2443.22, 12.21), dim = c(4L, 4L), dimnames = list(NULL, c("1", "2", "3", "4")))