rfilterdplyr

Select rows at random in R under certain conditions


From the original sample dataset:

id <- c('1','1','2', '2', '3', '3', '3')
month <- c('6', '6', '3', '3', '4', '4', '4')
iso <- c('MEX', 'USA', 'CRI', 'SPA', 'CHN', 'MEX', 'SPA')
value <- c('1550', '1550', '384', '115', '1100', '1100', '1100')
original <- data.frame(id, month, iso, value)

I want to end up with only 1 observation for each id-month pair. The rule to follow is:

  1. Choose the maximum value for each id-month pair.
  2. In case there is the same maximum value for the different observations of the same id-month pair, I want to choose one of the rows at random.

Therefore, the new sample dataset will look like this:

id <- c('1', '2', '3')
month <- c('6', '3', '4')
iso <- c('USA', 'CRI','MEX')
value <- c('1550', '384', '1100')
selection_criteria <- c('random','max_value','random')
new <- data.frame(id, month, iso, value, selection_criteria)

I have tried to run the following code:

new <- original %>% group_by(id, month) %>%
  filter(value == max(value))

However, it does not make the trick of selecting one variable at random when I have more than one observation (for the same id-month pair) with a maximum value.

My intention is to automatise the process given the large dimension of my dataset.

Any clue?

Thank you.


Solution

  • Old Answer

    set.seed(2021)
    
    new <- original %>% group_by(id, month) %>%
      slice_max(as.numeric(value)) %>% sample_n(1)
    
    > new
    # A tibble: 3 x 4
    # Groups:   id, month [3]
      id    month iso   value
      <chr> <chr> <chr> <chr>
    1 1     6     MEX   1550 
    2 2     3     CRI   384  
    3 3     4     MEX   1100
    

    slice_max will cause filtering all max of values rows in each group. Further sample_n(size =1) will again restrict one row in each group


    However, the approach suggested by other user is better as it reduces the number of steps-

    set.seed(2021)
    
    original %>% 
      group_by(id, month) %>% 
      slice_max(order_by = as.numeric(value), n = 1, with_ties = FALSE)