rdataframedplyrfilterdata.table

Filter grouped data with filter function and if else condition in R


Consider the data frame

id <-c(1,1,1,2,2,3,3,3,3)
x1 <-c("no","yes","yes","no","no","no","no","no","yes")
x2 <-c(9,3,7,4,5,2,5,6,11)
df <- data.frame(id, x1, x2)

For each id I wish to keep the first row where x1="yes" and where x1 is completely "no" for a particular id I want to retain the last row for that id. I wrote the code

df %>% group_by(id) %>% 
       filter(if(x1 == "yes") x1 == "yes" else row_number() %in% c(n()))

but it runs into an error. The expect is

id  x1 x2
 1 yes  3
 2  no  5
 3 yes 11

I was wondering if this code could be improved to get this issue resolved? I also appreciate alternative approaches (e.g., data.table approach) of handling this. Thank you!


Solution

  • your code fails because filter produces a vector for each group instead of a single true/false.
    so x1 == "yes" produces a vector like c(FALSE, TRUE, TRUE) while your if statement expects a singular TRUE or FALSE.

    I recommend using slice for this use case as such:

    df %>%
      group_by(id) %>%
      slice(if (any(x1 == "yes")) which.max(x1 == "yes") else n())
    

    any(x1 == "yes") makes it so that it checks the entire group and returns a single TRUE or FALSE
    and using slice() picks rows by their position (eg: 1st row, 5th row)

    so which.max(x1 == "yes") finds the position of the first "yes" and selects it.
    and if none exist, it gives the position of the last "no"