I'd like to know if there is any efficient way of checking if a given pair (or tuple of more than two) of columns is in a data frame.
For example, suppose I had the following data frame:
df=data.frame(c("a","b","c","d"),c("e","f","g","h"),c(1,0,0,1))
names(df)=c('col1','col2','col3')
col1 col2 col3
1 a e 1
2 b f 0
3 c g 0
4 d h 1
and I want to check if this table contains a list of pairs of columns, say: (a,b), (a,c), (a,e), (c,a), (c,g), (a,f)
to which it should output:
FALSE FALSE TRUE FALSE TRUE FALSE
Edit: added a new pair (a,f) to avoid confusion
I thought of doing this by concatenating the columns into strings and then comparing with %in%, but this is quite inefficient. I also thought of doing a loop with dplyr's filter, but it also takes quite a time when the table is huge and requires converting formats (i.e. writing several lines).
Is there any efficient way of accomplishing this in R?
This seems like a case for one of the apply
or lapply
family of functions. If you define pairs.list
as a list
, you can use lapply
:
df = data.frame(c("a","b","c","d"), c("e","f","g","h"), c(1,0,0,1))
names(df) = c('col1','col2','col3')
pairs.list = list(c("a", "b"), c("a", "c"), c("a", "e"), c("c", "a"), c("c", "g"))
lapply(pairs.list, FUN=function(x){any(df$col1==x[[1]] & df$col2==x[[2]])})
[[1]]
[1] FALSE
[[2]]
[1] FALSE
[[3]]
[1] TRUE
[[4]]
[1] FALSE
[[5]]
[1] TRUE
new.pairs = list(c("a", "b"), c("a", "c"), c("e", "a"), c("c", "a"), c("c", "g"))
lapply(new.pairs, FUN=function(x){any(df$col1==x[[1]] & df$col2==x[[2]])})
[[1]]
[1] FALSE
[[2]]
[1] FALSE
[[3]]
[1] FALSE
[[4]]
[1] FALSE
[[5]]
[1] TRUE
With this method, if you want to know the row of the df
that matches, you can get rid of the any()
call and receive a list of a vector of booleans, where each vector is the same length as df
.
I think this should be relatively efficient because it is all boolean logic rather than string manipulation, but I'm not an expert at benchmarking performance in R so I don't know for sure.