I have data that looks like so:
example <- data.table(
person_id = c("A1", "A1", "A1", "A1", "A2", "A2", "A3", "A3", "B1", "B1", "C1", "C1", "C2", "C2"),
year = c(2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016),
group_id = c("abc", "abc", "cdz", "cdz", "abc", "abc", "ghe", "ghe", "abc", "fjx", "ghe", "ghe", "cdz", "cdz")
)
I need to create a column that for each person shows the other people they share a group with in a particular year. What I have works, but takes an extraordinarily long time. There are billions of rows in my dataset.
res <- example %>%
group_by(year, group_id) %>%
mutate(connections = list(person_id)) %>%
group_by(year, person_id) %>%
summarise(connections = toString(setdiff(unlist(connections), person_id)))
I was hoping to use data.table's group by as I had read it was faster, but it does not give me the same result. So my question has two parts. Why do these not accomplish the same thing? And what do I need to change in the code below so it has the same result as the code above? Or is there a more efficient way to go about this?
example <- example[, connections := list(person_id), by=list(year, group_id)]
res <- example[, .(connections = toString(setdiff(unlist(connections), person_id))), by=list(year, person_id)]
Note: my R version is 3.4.3 and I do not have tidyverse installed (can't install).
A data.table
equivalent would be
library(data.table)
setDT(example)
example[, .(person_id, connections = list(person_id)),
by = .(year, group_id)][
, .(connections = toString(setdiff(unlist(connections), person_id))),
by = .(year, person_id)]
year person_id connections
<num> <char> <char>
1: 2015 A1 A2, B1, C2
2: 2015 A2 A1, B1
3: 2015 B1 A1, A2
4: 2016 A1 A2, C2
5: 2016 A2 A1
6: 2015 C2 A1
7: 2016 C2 A1
8: 2015 A3 C1
9: 2015 C1 A3
10: 2016 A3 C1
11: 2016 C1 A3
12: 2016 B1
Note, you can use connections :=
instead of .(person_id, connections =
but that would change the data table in place. Using the normal =
leaves the original data intact.
Unit: milliseconds
expr min lq mean median uq max neval
dplyr 14.02043 14.632521 15.962934 15.359556 16.387701 24.28986 100
dt 2.93402 3.229095 3.610307 3.402923 3.808711 10.49287 100
library(dplyr)
library(microbenchmark)
getDTthreads()
[1] 1
microbenchmark(
dplyr = {example %>%
group_by(year, group_id) %>%
mutate(connections = list(person_id)) %>%
group_by(year, person_id) %>%
summarise(connections = toString(setdiff(unlist(connections), person_id)), .groups="drop")},
dt = {example[, .(person_id, connections = list(person_id)),
by = .(year, group_id)][
, .(connections = toString(setdiff(unlist(connections), person_id))),
by = .(year, person_id)]})