[SOLVED] find duplicates with grouped variables

find duplicates with grouped variables

I have a df that looks like this:

I guess it will work some with dplyr and duplicates. Yet I don't know how to address multiple columns while distinguishing between a grouped variable.

from  to  group

1     2   metro
2     4   metro
3     4   metro
4     5   train
6     1   train
8     7   train

I want to find the ids which exist in more than one group variable.

The expected result for the sample df is: 1 and 4. Because they exist in the metro and the train group.

Thank you in advance!

Solution

We gather the 'from', 'to' columns to 'long' format, grouped by 'val', filter the groups having more than one unique elements, then pull the unique 'val' elements

library(dplyr)
library(tidyr)
df1 %>% 
   gather(key, val, from:to) %>% 
   group_by(val) %>% 
   filter(n_distinct(group) > 1) %>%
   distinct(val) %>%
   pull(val)
#[1] 1 4

Or using base R we can just table to find the frequency, and get the ids out of it

out <-  with(df1, colSums(table(rep(group, 2), unlist(df1[1:2])) > 0)) > 1
names(which(out))
#[1] "1" "4"

data

df1 <- structure(list(from = c(1L, 2L, 3L, 4L, 6L, 8L), to = c(2L, 4L, 
 4L, 5L, 1L, 7L), group = c("metro", "metro", "metro", "train", 
 "train", "train")), class = "data.frame", row.names = c(NA, -6L
 ))