rduplicatesidentify

find duplicates with grouped variables


I have a df that looks like this:

I guess it will work some with dplyr and duplicates. Yet I don't know how to address multiple columns while distinguishing between a grouped variable.

from  to  group

1     2   metro
2     4   metro
3     4   metro
4     5   train
6     1   train
8     7   train

I want to find the ids which exist in more than one group variable.

The expected result for the sample df is: 1 and 4. Because they exist in the metro and the train group.

Thank you in advance!


Solution

  • We gather the 'from', 'to' columns to 'long' format, grouped by 'val', filter the groups having more than one unique elements, then pull the unique 'val' elements

    library(dplyr)
    library(tidyr)
    df1 %>% 
       gather(key, val, from:to) %>% 
       group_by(val) %>% 
       filter(n_distinct(group) > 1) %>%
       distinct(val) %>%
       pull(val)
    #[1] 1 4
    

    Or using base R we can just table to find the frequency, and get the ids out of it

    out <-  with(df1, colSums(table(rep(group, 2), unlist(df1[1:2])) > 0)) > 1
    names(which(out))
    #[1] "1" "4"
    

    data

    df1 <- structure(list(from = c(1L, 2L, 3L, 4L, 6L, 8L), to = c(2L, 4L, 
     4L, 5L, 1L, 7L), group = c("metro", "metro", "metro", "train", 
     "train", "train")), class = "data.frame", row.names = c(NA, -6L
     ))