From the original sample dataset:
id <- c('1','1','2', '2', '3', '3', '3')
month <- c('6', '6', '3', '3', '4', '4', '4')
iso <- c('MEX', 'USA', 'CRI', 'SPA', 'CHN', 'MEX', 'SPA')
value <- c('1550', '1550', '384', '115', '1100', '1100', '1100')
original <- data.frame(id, month, iso, value)
I want to end up with only 1 observation for each id-month pair. The rule to follow is:
Therefore, the new sample dataset will look like this:
id <- c('1', '2', '3')
month <- c('6', '3', '4')
iso <- c('USA', 'CRI','MEX')
value <- c('1550', '384', '1100')
selection_criteria <- c('random','max_value','random')
new <- data.frame(id, month, iso, value, selection_criteria)
I have tried to run the following code:
new <- original %>% group_by(id, month) %>%
filter(value == max(value))
However, it does not make the trick of selecting one variable at random when I have more than one observation (for the same id-month pair) with a maximum value.
My intention is to automatise the process given the large dimension of my dataset.
Any clue?
Thank you.
Old Answer
set.seed(2021)
new <- original %>% group_by(id, month) %>%
slice_max(as.numeric(value)) %>% sample_n(1)
> new
# A tibble: 3 x 4
# Groups: id, month [3]
id month iso value
<chr> <chr> <chr> <chr>
1 1 6 MEX 1550
2 2 3 CRI 384
3 3 4 MEX 1100
slice_max will cause filtering all max of values rows in each group. Further sample_n(size =1) will again restrict one row in each group
However, the approach suggested by other user is better as it reduces the number of steps-
set.seed(2021)
original %>%
group_by(id, month) %>%
slice_max(order_by = as.numeric(value), n = 1, with_ties = FALSE)