rdata.table

When are data.tables's := and dpylr's mutate different?


I have data that looks like so:

example <- data.table(
person_id = c("A1", "A1", "A1", "A1", "A2", "A2", "A3", "A3", "B1", "B1", "C1", "C1", "C2", "C2"),
year = c(2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016),
group_id = c("abc", "abc", "cdz", "cdz", "abc", "abc", "ghe", "ghe", "abc", "fjx", "ghe", "ghe", "cdz", "cdz")
)

I need to create a column that for each person shows the other people they share a group with in a particular year. What I have works, but takes an extraordinarily long time. There are billions of rows in my dataset.

res <- example %>%
  group_by(year, group_id) %>%
  mutate(connections = list(person_id)) %>%
  group_by(year, person_id) %>%
  summarise(connections = toString(setdiff(unlist(connections), person_id)))

I was hoping to use data.table's group by as I had read it was faster, but it does not give me the same result. So my question has two parts. Why do these not accomplish the same thing? And what do I need to change in the code below so it has the same result as the code above? Or is there a more efficient way to go about this?

example <- example[, connections := list(person_id), by=list(year, group_id)]
res <- example[, .(connections = toString(setdiff(unlist(connections), person_id))), by=list(year, person_id)]

Note: my R version is 3.4.3 and I do not have tidyverse installed (can't install).


Solution

  • A data.table equivalent would be

    library(data.table)
    
    setDT(example)
    
    example[, .(person_id, connections = list(person_id)), 
        by = .(year, group_id)][
      , .(connections = toString(setdiff(unlist(connections), person_id))), 
        by = .(year, person_id)]
         year person_id connections
        <num>    <char>      <char>
     1:  2015        A1  A2, B1, C2
     2:  2015        A2      A1, B1
     3:  2015        B1      A1, A2
     4:  2016        A1      A2, C2
     5:  2016        A2          A1
     6:  2015        C2          A1
     7:  2016        C2          A1
     8:  2015        A3          C1
     9:  2015        C1          A3
    10:  2016        A3          C1
    11:  2016        C1          A3
    12:  2016        B1
    

    Note, you can use connections := instead of .(person_id, connections = but that would change the data table in place. Using the normal = leaves the original data intact.

    Performance comparison
    Unit: milliseconds
      expr      min        lq      mean    median        uq      max neval
     dplyr 14.02043 14.632521 15.962934 15.359556 16.387701 24.28986   100
        dt  2.93402  3.229095  3.610307  3.402923  3.808711 10.49287   100
    
    library(dplyr)
    library(microbenchmark)
    
    getDTthreads()
    [1] 1
    
    microbenchmark(
      dplyr = {example %>%
        group_by(year, group_id) %>%
        mutate(connections = list(person_id)) %>%
        group_by(year, person_id) %>%
        summarise(connections = toString(setdiff(unlist(connections), person_id)), .groups="drop")},
      dt = {example[, .(person_id, connections = list(person_id)),
          by = .(year, group_id)][
        , .(connections = toString(setdiff(unlist(connections), person_id))),
          by = .(year, person_id)]})