I would like to subset an R dataframe to display all the rows that contain specific values that could be sitting in two or more columns. Suppose I've got dataframe df
HomeTeam AwayTeam
1 Barcelona Ajax
2 Madrid Liverpool
3 Arsenal Barcelona
4 Madrid Barcelona
5 ManU Arsenal
6 Ajax Porto
And I want to display only the rows which contain "Barcelona" or "Arsenal", i.e. I want to get
HomeTeam AwayTeam
1 Barcelona Ajax
2 Arsenal Barcelona
3 Madrid Barcelona
4 ManU Arsenal
I attempted to subset the dataframe and incorporate OR condition || like this:
df = df[df$HomeTeam %in% c("Barcelona", "Arsenal")
|| df$AwayTeam %in% c("Barcelona", "Arsenal"), ]
But it gives me an error. What am I doing wrong and how to achieve the desired outcome?
As mentioned in the comments | is vectorized but || is not so your code would work if you make that change or you can try the other alternatives below, some of which are a bit shorter.
The difference is that || is not vectorized but does short circuit, i.e. the right hand side is not evaluated if the left hand side is TRUE. For example, the first below stops with an Oops error but the second returns TRUE.
TRUE | stop("Oops")
## Error: Oops
TRUE || stop("Oops")
## [1] TRUE
Also it is normally best not to clobber df
as in df <- df[..., ]
but rather create a new data frame df2 <- df[..., ]
. This will make it easier to debug since you will then know that df
is always the input and also df
and df2
will both exist at the same time so you can compare them which can be useful during debugging.
# 1
df[df$HomeTeam %in% c("Barcelona", "Arsenal") |
df$AwayTeam %in% c("Barcelona", "Arsenal"), ]
# 2
targets <- c("Barcelona", "Arsenal")
subset(df, HomeTeam %in% targets | AwayTeam %in% targets)
# 3
df[rowSums(df == "Arsenal" | df == "Barcelona") > 0, ]
# 4
subset(df, rowSums(df == "Arsenal" | df == "Barcelona") > 0)
# 5
pat <- "Arsenal|Barcelona"
subset(df, grepl(pat, HomeTeam) | grepl(pat, AwayTeam))