I'm trying to find the number of consecutive runs an ID has of any score, ordered by date. I've come across a simple method using apply() that seems to work, however, that just gives me the length of each run, not the score or ID associated with it.
The dataframe consists of the following:
DATE ID SCORE
2021-04-05 12690 67
2021-04-05 12278 47
2021-04-05 12153 64
---
2021-03-26 12690 88
2021-03-26 12278 47
---
2021-03-20 12690 67
+ 120,000 more rows
DATE is in YYYY-MM-DD format
Based off the sample df above, I'd like it to return something along the lines of this:
12690 67 1
12690 88 1
12690 67 1
12278 47 2
12153 64 1
i.e. for the ID 12690, the score 67 appears once, then 88 once, then 67 once, etc
I can make it work if I force it to use only one ID at a time, but this doesn't seem terribly efficient. Whats the best way I can go about this? TIA
Here's an attempt that works on the data provided. If it doesn't work on your real data, post a bit more complex example dataset and we can try again.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
dat <- tibble::tribble(
~DATE, ~ID, ~SCORE,
"2021-04-05", 12690, 67,
"2021-04-05", 12278, 47,
"2021-04-05", 12153, 64,
"2021-03-26", 12690, 88,
"2021-03-26", 12278, 47,
"2021-03-20", 12690, 67) %>%
mutate(DATE = lubridate::ymd(DATE))
tmp <- dat %>%
arrange(ID, DATE) %>%
group_by(ID) %>%
mutate(eq = SCORE == lead(SCORE),
eq = case_when(is.na(eq) ~ FALSE,
TRUE ~ eq),
eq = case_when(!eq ~ SCORE == lag(SCORE),
TRUE ~ eq),
eq = case_when(is.na(eq) ~ FALSE,
TRUE ~ eq))
tmps <- tmp %>% split(tmp$eq)
tmps[[2]] <- tmps[[2]] %>%
group_by(ID, SCORE) %>%
summarise(n = sum(eq))
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups`
#> argument.
tmps[[1]] <- tmps[[1]] %>%
select(ID, SCORE) %>%
mutate(n=1)
bind_rows(tmps)
#> # A tibble: 5 × 3
#> # Groups: ID [3]
#> ID SCORE n
#> <dbl> <dbl> <dbl>
#> 1 12153 64 1
#> 2 12690 67 1
#> 3 12690 88 1
#> 4 12690 67 1
#> 5 12278 47 2
Created on 2022-04-13 by the reprex package (v2.0.1)
Here's an alternative that may keep different runs of the same number for the same ID separate.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
dat <- tibble::tribble(
~DATE, ~ID, ~SCORE,
"2021-04-05", 12690, 67,
"2021-04-05", 12278, 47,
"2021-04-05", 12153, 64,
"2021-03-26", 12690, 88,
"2021-03-26", 12278, 47,
"2021-03-20", 12690, 67) %>%
mutate(DATE = lubridate::ymd(DATE))
tmp <- dat %>%
arrange(ID, DATE) %>%
group_by(ID) %>%
mutate(obs = 1:n()) %>%
ungroup %>%
mutate(new_grp = as.numeric(SCORE != lag(SCORE) | ID != lag(ID) | obs != lag(obs) + 1),
new_grp = ifelse(is.na(new_grp), 1, new_grp),
group = cumsum(new_grp)) %>%
group_by(ID, SCORE, group) %>%
summarise(n = n()) %>%
ungroup %>%
select(-group)
#> `summarise()` has grouped output by 'ID', 'SCORE'. You can override using the
#> `.groups` argument.
tmp
#> # A tibble: 5 × 3
#> ID SCORE n
#> <dbl> <dbl> <int>
#> 1 12153 64 1
#> 2 12278 47 2
#> 3 12690 67 1
#> 4 12690 67 1
#> 5 12690 88 1
Created on 2022-04-14 by the reprex package (v2.0.1)