rif-statementmergedata.table

Merge two data.tables based on AND condition and add third column


I have two data.tables "commuters_old" and "commuter_new":

commuters_old: (1.703.000 observations)

from1 to1 amount1
1001 1001 12
1001 1002 2
1001 1003 23
2000 1002 200
2001 2001 80
2001 2002 98

commuters_new: (1.400.000 observations)

from2 to2 amount2
1000 1001 10
1001 1002 22
1001 1004 2
2000 1002 50
2001 2001 8
2001 2003 100
library(data.table)

commuters_old <- data.table(
  from1 = c(1001L, 1001L, 1001L, 2000L, 2001L, 2001L),
  to1 = c(1001L, 1002L, 1003L, 1002L, 2001L, 2002L),
  amount1 = c(12L, 2L, 23L, 200L, 80L, 98L))

commuters_new <- data.table(
  from2 = c(1000L, 1001L, 1001L, 2000L, 2001L, 2001L),
  to2 = c(1001L, 1002L, 1004L, 1002L, 2001L, 2003L),
  amount2 = c(10L, 22L, 2L, 50L, 8L, 100L))

I would like to add the column "amount1" to data.table "commuters_new" but only if "from1 == from2" AND "to1 == to2". If this condition is not met, I want to write 0, so that:

commuters_new:

from2 to2 amount2 amount1
1000 1001 10 0
1001 1002 22 2
1001 1004 2 0
2000 1002 50 200
2001 2001 8 80
2001 2003 100 0

After that merge, I want to add another column "factor" that gives me the factor of "amount2" and"amount1".

commuters_new:

from2 to2 amount2 amount1 factor
1000 1001 10 0 0
1001 1002 22 2 11
1001 1004 2 0 0
2000 1002 50 200 0.25
2001 2001 8 80 0.1
2001 2003 100 0 0

What I have tried:

commuters_new[, amount1 := ifelse(from1==from2 & to1==to2, commuters_old$amount1, 0)]

OR

commuters_new$amount1[, paste(ifelse(from1==from2 & to1==to2, commuters_old$amount1, 0))]

But I always get the error:

Supplied 1703144 items to be assigned to 1408969 items of column 'amount1'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code


Solution

  • Another way, using dtplyr:

    Steps:

    1. Load it as a lazy data table
    2. Perform a left join, joining new on old
    3. Replace the empty values in our new amount1 column with 0s
    4. If amount1 is 0, factor is 0. Otherwise, it is amount2 / amount1
    5. Tell dtplyr to run the code by running as.data.table()
    pacman::p_load(data.table, dtplyr, tidyverse)
    
    commuters_new |>
      lazy_dt() |>
      left_join(commuters_old, by = c("from2" = "from1", "to2" = "to1")) |>
      replace_na(list(amount1 = 0L))|>
      mutate(factor = ifelse(!amount1, 0, amount2 / amount1)) |>
      as.data.table()
    

    Equivalent data.table code:

    setnames(setcolorder(`_DT34`[`_DT33`, on = .(from1 = from2, to1 = to2), 
        allow.cartesian = TRUE], c(1L, 2L, 4L, 3L)), c("from1", "to1"
    ), c("from2", "to2"))[, `:=`(amount1 = fcoalesce(amount1, 0L))][, 
        `:=`(factor = fifelse(amount1 == 0, 0, amount2/amount1))]
    # or
    out <- (commuters_old[commuters_new, on = .(from1 = from2, to1 = to2)]
           [, amount1 := fcoalesce(amount1, 0L)]
           [, factor := ifelse(!amount1, 0, amount2 / amount1)])
    

    Output:

       from1  to1 amount1 amount2 factor
    1:  1000 1001       0      10   0.00
    2:  1001 1002       2      22  11.00
    3:  1001 1004       0       2   0.00
    4:  2000 1002     200      50   0.25
    5:  2001 2001      80       8   0.10
    6:  2001 2003       0     100   0.00