rdplyrtime-seriessummarizepreserve

Use dplyr to summarize but preserve date of group row


I have a data frame like the following:

          Date Flare Painmed_Use
1   2015-12-01     0           0
2   2015-12-02     0           0
3   2015-12-03     0           0
4   2015-12-04     0           0
5   2015-12-05     0           0
6   2015-12-06     0           1
7   2015-12-07     1           4
8   2015-12-08     1           3
9   2015-12-09     1           1
10  2015-12-10     1           0
11  2015-12-11     0           0
12  2015-12-12     0           0
13  2015-12-13     1           2
14  2015-12-14     1           3
15  2015-12-15     1           1
16  2015-12-16     0           0

I'm trying to find the length of each flare as well as the total med use during each flare using dplyr. My current solution (inspired by Use rle to group by runs when using dplyr),

df %>% 
    group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths), yy$lengths)}, Flare) %>%
    summarize(Painmed_UseCum = sum(Painmed_Use),FlareLength = n())

gives the following output:

     yy Flare Painmed_UseCum FlareLength
   <int> <int>          <dbl>       <int>
 1     1     0              1           6
 2     2     1              8           4
 3     3     0              0           2
 4     4     1              6           3
 5     5     0              0           1

This is almost exactly what I need. However, I can't figure out how to preserve other columns, the critical one being the date that corresponds to the last row of a particular flare. So, the output I'm seeking is the same as above but with the addition of the Dates, like so:

           Date      yy Flare Painmed_UseCum FlareLength
                  <int> <int>          <dbl>       <int>
 1   2015-12-06       1     0              1           6
 2   2015-12-10       2     1              8           4
 3   2015-12-12       3     0              0           2
 4   2015-12-15       4     1              6           3
 5   2015-12-16       5     0              0           1

Note: In some ways this is a follow up from a previous question of mine (R code to get max count of time series data by group) but my attempt to keep that question simpler, though perhaps useful to others, ended up necessitating this further question.


Solution

  • You could either include Date in summarise

    library(dplyr)
    
    df %>% 
      group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths),yy$lengths)}) %>%
      summarize(Painmed_UseCum = sum(Painmed_Use),FlareLength = n(), Date = max(Date))
    
    # Groups:   yy, Flare [5]
    #  Date       Flare Painmed_Use    yy
    #  <date>     <int>       <int> <int>
    #1 2015-12-06     0           1     1
    #2 2015-12-10     1           0     2
    #3 2015-12-12     0           0     3
    #4 2015-12-15     1           1     4
    #5 2015-12-16     0           0     5
    

    Or if there are more columns to preserve better approach is to use mutate and select the last row in each group.

    df %>% 
      group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths), yy$lengths)}) %>%
      mutate(Painmed_UseCum = sum(Painmed_Use),FlareLength = n()) %>%
      slice(n())
    

    To create groups, we can replace rle with rleid from data.table which would be simpler.

    group_by(yy = data.table::rleid(Flare))