I have been having so much fun working with the Base R function "Which" to deal with sequences but on a larger scale I will soon be using Tidyverse to do some data wrangling and I can't seem to find a way to incorporate this functions with Tidyverse.
Let me explain what I want to do. I have the following dataset:
STUDENT | EVENT |
---|---|
James | GEO_EXAM |
James | PIZZA_PARTY |
Kylie | MATH_EXAM |
Kylie | PIZZA_PARTY |
Laura | MATH_EXAM |
Mark | MATH_EXAM |
Mark | PIZZA_PARTY |
I want to be able to sort the events (for simplicity these events are in the desired order) and then group by STUDENT and find what was the last event or last events or each group of students prior to a PIZZA_PARTY (we will assume the order goes from top to bottom) the dataset I have is oversimplified to illustrate this case.
That been said, for the previous data set we'll have that for James the event prior to a PIZZA_PARTY was a GEO exam, for Kylie a MATH exam, for Laura there was not event prior to a PIZZA_PARTY and finally for Mark it was a MATH exam. This would be the desired outcome.
But if I omit the fact that I indeed want to resolve this exercise grouping by STUDENTS I could easy use the following code to get started
df$EVENT[which(df$EVENT=="PIZZA_PARTY")-1]
but if I want to use Tidyverse to solve this issue in the way that I want I would use:
df %>%
group_by(STUDENT) %>%
mutate(INDEX=(which(EVENT=="PIZZA_PARTY")-1))
but I get an error, I also get an error when I use summarise. How can I learn how to properly use base R functions with Tidyverse?
Issue is when we have a group with missing value i.e. for 'Laura', there is no PIZZA_PARTY
> subset(df, STUDENT == "Laura" & EVENT == "PIZZA_PARTY")
[1] STUDENT EVENT
<0 rows> (or 0-length row.names)
and thus returns logical(0)
. In mutate
, it requires the same length of output as the number of rows (or if it is grouped, the length should match the same number of elements of group). Note that even duplicate values can result in the same error. So, it is better to get the index of the first available value
library(dplyr)
df %>%
group_by(STUDENT) %>%
mutate(INDEX=which(EVENT == "PIZZA_PARTY")[1]-1)
-output
# A tibble: 7 × 3
# Groups: STUDENT [4]
STUDENT EVENT INDEX
<chr> <chr> <dbl>
1 James GEO_EXAM 1
2 James PIZZA_PARTY 1
3 Kylie MATH_EXAM 1
4 Kylie PIZZA_PARTY 1
5 Laura MATH_EXAM NA
6 Mark MATH_EXAM 1
7 Mark PIZZA_PARTY 1
Or another option is to also use match
df %>%
group_by(STUDENT) %>%
mutate(INDEX = match("PIZZA_PARTY", EVENT) - 1)
If we need this to create a RESPONSE variable with the corresponding 'EVENT'
df %>%
group_by(STUDENT) %>%
mutate(RESPONSE = EVENT[match("PIZZA_PARTY", EVENT) - 1])
-output
# A tibble: 7 × 3
# Groups: STUDENT [4]
STUDENT EVENT EVENTNEW
<chr> <chr> <chr>
1 James GEO_EXAM GEO_EXAM
2 James PIZZA_PARTY GEO_EXAM
3 Kylie MATH_EXAM MATH_EXAM
4 Kylie PIZZA_PARTY MATH_EXAM
5 Laura MATH_EXAM <NA>
6 Mark MATH_EXAM MATH_EXAM
7 Mark PIZZA_PARTY MATH_EXAM
df <- structure(list(STUDENT = c("James", "James", "Kylie", "Kylie",
"Laura", "Mark", "Mark"), EVENT = c("GEO_EXAM", "PIZZA_PARTY",
"MATH_EXAM", "PIZZA_PARTY", "MATH_EXAM", "MATH_EXAM", "PIZZA_PARTY"
)), class = "data.frame", row.names = c(NA, -7L))