rdataframedplyrsubset

How to subset R dataframe based on specific values in several columns?


I would like to subset an R dataframe to display all the rows that contain specific values that could be sitting in two or more columns. Suppose I've got dataframe df

   HomeTeam    AwayTeam
1 Barcelona      Ajax
2   Madrid    Liverpool
3   Arsenal   Barcelona
4   Madrid    Barcelona
5    ManU       Arsenal
6    Ajax       Porto

And I want to display only the rows which contain "Barcelona" or "Arsenal", i.e. I want to get

   HomeTeam     AwayTeam
1  Barcelona      Ajax
2   Arsenal     Barcelona
3   Madrid      Barcelona
4    ManU        Arsenal

I attempted to subset the dataframe and incorporate OR condition || like this:

df = df[df$HomeTeam %in% c("Barcelona", "Arsenal") 
                   || df$AwayTeam %in% c("Barcelona", "Arsenal"), ]

But it gives me an error. What am I doing wrong and how to achieve the desired outcome?


Solution

  • As mentioned in the comments | is vectorized but || is not so your code would work if you make that change or you can try the other alternatives below, some of which are a bit shorter.

    The difference is that || is not vectorized but does short circuit, i.e. the right hand side is not evaluated if the left hand side is TRUE. For example, the first below stops with an Oops error but the second returns TRUE.

    TRUE | stop("Oops")
    ## Error: Oops
    
    TRUE || stop("Oops")
    ## [1] TRUE
    

    Also it is normally best not to clobber df as in df <- df[..., ] but rather create a new data frame df2 <- df[..., ]. This will make it easier to debug since you will then know that df is always the input and also df and df2 will both exist at the same time so you can compare them which can be useful during debugging.

    # 1
    df[df$HomeTeam %in% c("Barcelona", "Arsenal") | 
       df$AwayTeam %in% c("Barcelona", "Arsenal"), ]
    
    # 2
    targets <- c("Barcelona", "Arsenal")
    subset(df, HomeTeam %in% targets | AwayTeam %in% targets)
    
    # 3
    df[rowSums(df == "Arsenal" | df == "Barcelona") > 0, ]
    
    # 4
    subset(df, rowSums(df == "Arsenal" | df == "Barcelona") > 0)
    
    # 5
    pat <- "Arsenal|Barcelona"
    subset(df, grepl(pat, HomeTeam) | grepl(pat, AwayTeam))