rtidyverse

Working with Tidyverse and the base R function ´Which´


I have been having so much fun working with the Base R function "Which" to deal with sequences but on a larger scale I will soon be using Tidyverse to do some data wrangling and I can't seem to find a way to incorporate this functions with Tidyverse.

Let me explain what I want to do. I have the following dataset:

STUDENT EVENT
James GEO_EXAM
James PIZZA_PARTY
Kylie MATH_EXAM
Kylie PIZZA_PARTY
Laura MATH_EXAM
Mark MATH_EXAM
Mark PIZZA_PARTY

I want to be able to sort the events (for simplicity these events are in the desired order) and then group by STUDENT and find what was the last event or last events or each group of students prior to a PIZZA_PARTY (we will assume the order goes from top to bottom) the dataset I have is oversimplified to illustrate this case.

That been said, for the previous data set we'll have that for James the event prior to a PIZZA_PARTY was a GEO exam, for Kylie a MATH exam, for Laura there was not event prior to a PIZZA_PARTY and finally for Mark it was a MATH exam. This would be the desired outcome.

But if I omit the fact that I indeed want to resolve this exercise grouping by STUDENTS I could easy use the following code to get started

df$EVENT[which(df$EVENT=="PIZZA_PARTY")-1]

but if I want to use Tidyverse to solve this issue in the way that I want I would use:

df %>% 
  group_by(STUDENT) %>% 
  mutate(INDEX=(which(EVENT=="PIZZA_PARTY")-1))

but I get an error, I also get an error when I use summarise. How can I learn how to properly use base R functions with Tidyverse?


Solution

  • Issue is when we have a group with missing value i.e. for 'Laura', there is no PIZZA_PARTY

    > subset(df, STUDENT == "Laura" & EVENT == "PIZZA_PARTY")
    [1] STUDENT EVENT  
    <0 rows> (or 0-length row.names)
    

    and thus returns logical(0). In mutate, it requires the same length of output as the number of rows (or if it is grouped, the length should match the same number of elements of group). Note that even duplicate values can result in the same error. So, it is better to get the index of the first available value

    library(dplyr)
    df %>% 
      group_by(STUDENT) %>% 
      mutate(INDEX=which(EVENT == "PIZZA_PARTY")[1]-1)
    

    -output

    # A tibble: 7 × 3
    # Groups:   STUDENT [4]
      STUDENT EVENT       INDEX
      <chr>   <chr>       <dbl>
    1 James   GEO_EXAM        1
    2 James   PIZZA_PARTY     1
    3 Kylie   MATH_EXAM       1
    4 Kylie   PIZZA_PARTY     1
    5 Laura   MATH_EXAM      NA
    6 Mark    MATH_EXAM       1
    7 Mark    PIZZA_PARTY     1
    

    Or another option is to also use match

    df %>%
        group_by(STUDENT) %>%
        mutate(INDEX = match("PIZZA_PARTY", EVENT) - 1)
    

    If we need this to create a RESPONSE variable with the corresponding 'EVENT'

    df %>%  
          group_by(STUDENT) %>% 
          mutate(RESPONSE = EVENT[match("PIZZA_PARTY", EVENT) - 1])
    

    -output

    # A tibble: 7 × 3
    # Groups:   STUDENT [4]
      STUDENT EVENT       EVENTNEW 
      <chr>   <chr>       <chr>    
    1 James   GEO_EXAM    GEO_EXAM 
    2 James   PIZZA_PARTY GEO_EXAM 
    3 Kylie   MATH_EXAM   MATH_EXAM
    4 Kylie   PIZZA_PARTY MATH_EXAM
    5 Laura   MATH_EXAM   <NA>     
    6 Mark    MATH_EXAM   MATH_EXAM
    7 Mark    PIZZA_PARTY MATH_EXAM
    

    data

    df <- structure(list(STUDENT = c("James", "James", "Kylie", "Kylie", 
    "Laura", "Mark", "Mark"), EVENT = c("GEO_EXAM", "PIZZA_PARTY", 
    "MATH_EXAM", "PIZZA_PARTY", "MATH_EXAM", "MATH_EXAM", "PIZZA_PARTY"
    )), class = "data.frame", row.names = c(NA, -7L))