rmultiple-columnscontainskey

Check if pair of columns is in a row of a data frame


I'd like to know if there is any efficient way of checking if a given pair (or tuple of more than two) of columns is in a data frame.

For example, suppose I had the following data frame:

df=data.frame(c("a","b","c","d"),c("e","f","g","h"),c(1,0,0,1))
names(df)=c('col1','col2','col3')

  col1 col2 col3
1    a    e    1
2    b    f    0
3    c    g    0
4    d    h    1

and I want to check if this table contains a list of pairs of columns, say: (a,b), (a,c), (a,e), (c,a), (c,g), (a,f)

to which it should output:

FALSE FALSE TRUE FALSE TRUE FALSE

Edit: added a new pair (a,f) to avoid confusion

I thought of doing this by concatenating the columns into strings and then comparing with %in%, but this is quite inefficient. I also thought of doing a loop with dplyr's filter, but it also takes quite a time when the table is huge and requires converting formats (i.e. writing several lines).

Is there any efficient way of accomplishing this in R?


Solution

  • This seems like a case for one of the apply or lapply family of functions. If you define pairs.list as a list, you can use lapply:

    df = data.frame(c("a","b","c","d"), c("e","f","g","h"), c(1,0,0,1))
    names(df) = c('col1','col2','col3')
    pairs.list = list(c("a", "b"), c("a", "c"), c("a", "e"), c("c", "a"), c("c", "g"))
    lapply(pairs.list, FUN=function(x){any(df$col1==x[[1]] & df$col2==x[[2]])})
    
    [[1]]
    [1] FALSE
    
    [[2]]
    [1] FALSE
    
    [[3]]
    [1] TRUE
    
    [[4]]
    [1] FALSE
    
    [[5]]
    [1] TRUE
    
    new.pairs = list(c("a", "b"), c("a", "c"), c("e", "a"), c("c", "a"), c("c", "g"))
    
    lapply(new.pairs, FUN=function(x){any(df$col1==x[[1]] & df$col2==x[[2]])})
    
    [[1]]
    [1] FALSE
    
    [[2]]
    [1] FALSE
    
    [[3]]
    [1] FALSE
    
    [[4]]
    [1] FALSE
    
    [[5]]
    [1] TRUE
    

    With this method, if you want to know the row of the df that matches, you can get rid of the any() call and receive a list of a vector of booleans, where each vector is the same length as df.

    I think this should be relatively efficient because it is all boolean logic rather than string manipulation, but I'm not an expert at benchmarking performance in R so I don't know for sure.