rdplyrgroup-bycharactersummarize

R: Group by variables, summarise a numeric column with corresponding character column


I have an example test data frame:

YEAR <- c(2019, 2019,2020, 2020, 2019,2020,2020,2019,2020)
GROUP <- c("A","A","A", "A","B","B","B","A","B")
VALUE <- c(1,4,3, 8 ,5,3,2,6,7)
STATUS <- c("on","off", "off", "on", "on", "on", "off", "off", "off")

testdata <- data.frame(YEAR, GROUP, STATUS, VALUE)

which looks something like this when arranged by YEAR: [1]: https://i.sstatic.net/aYgQh.png

I wish to group by YEAR and GROUP, summarise to get the max VALUE which I can do fine with dplyr.

testdata %>%
  group_by(YEAR, GROUP) %>%
  summarise(max_value = max(VALUE))

But I want the output to have the grouped YEAR and GROUP, max_value and the STATUS that corresponds to the max(VALUE).

I have tried a few things but can't get it to work. This comes close:

testdata %>%
  group_by(YEAR, GROUP) %>%
  summarise(max_value = max(VALUE),
            STATUS = first(STATUS))

But the output is not correct as it is giving the first STATIS (obviously) rather than the STATUS that corresponds to the row where the maximum VALUE is.

eg. the one of the output rows is 2019, A, 6, on where I want it to be 2019, A, 6, off.

This is a test dataset, my actual data is a much larger time series but I am encountering the same problem.

Hope you can help!


Solution

  • library(tidyverse)
    
    testdata |> 
      group_by(YEAR, GROUP) |> 
      slice_max(VALUE, n = 1)
    
    # Groups:   YEAR, GROUP [4]
       YEAR GROUP STATUS VALUE
      <dbl> <chr> <chr>  <dbl>
    1  2019 A     off        6
    2  2019 B     on         5
    3  2020 A     on         8
    4  2020 B     off        7
    

    Or:

    testdata |> 
      slice_max(VALUE, n = 1, by = c(GROUP, YEAR))