rdatatableaverage

How to find the average of n largest values by group and add as a new column in r data table


I read a lot similar questions before asking a new one but here I am. I have a long data table that consist of plot, dbh, etc. An example of my data like this:

structure(list(plot = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), dbh = c(18L, 14L, 
13L, 20L, 20L, 15L, 9L, 12L, 22L, 21L, 14L, 14L, 13L, 18L, 24L, 
19L, 13L, 15L, 17L, 22L, 11L)), class = "data.frame", row.names = c(NA, 
-21L))

What I want to do is find the average of 5 largest values by group (plot) and add this values as a new column to the same data table. I'm expecting to get the following result.

structure(list(plot = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), dbh = c(18L, 14L, 
13L, 20L, 20L, 15L, 9L, 12L, 22L, 21L, 14L, 14L, 13L, 18L, 24L, 
19L, 13L, 15L, 17L, 22L, 11L), dom = c(17.4, 17.4, 17.4, 17.4, 
17.4, 17.4, 17.4, 17.4, 21.6, 21.6, 21.6, 21.6, 21.6, 21.6, 21.6, 
21.6, 21.6, 21.6, 21.6, 21.6, 21.6)), class = "data.frame", row.names = c(NA, 
-21L))

I will be appreciate for your help. Thanks.

PS: I tried many different codes within different packages (data.table, dplyr, etc..) however couldn't able to make it so I won't give any mwe that doesn't work.


Solution

  • You can use head/tail to get top 5 values :

    df$dom <- with(df, ave(dbh, plot, FUN = function(x) mean(tail(sort(x), 5))))
    #same as doing 1:5
    #df$dom <- with(df, ave(dbh, plot, FUN = function(x) 
                        mean(sort(x, decreasing = TRUE)[1:5])))
    

    Or using dplyr :

    library(dplyr)
    df %>% group_by(plot) %>% mutate(dom = mean(tail(sort(dbh), 5)))
    

    and data.table :

    library(data.table)
    setDT(df)[, dom := mean(tail(sort(dbh), 5)), plot]
    df
    
    #    plot dbh  dom
    # 1:    1  18 17.4
    # 2:    1  14 17.4
    # 3:    1  13 17.4
    # 4:    1  20 17.4
    # 5:    1  20 17.4
    # 6:    1  15 17.4
    # 7:    1   9 17.4
    # 8:    1  12 17.4
    # 9:    2  22 21.6
    #10:    2  21 21.6
    #11:    2  14 21.6
    #12:    2  14 21.6
    #13:    2  13 21.6
    #14:    2  18 21.6
    #15:    2  24 21.6
    #16:    2  19 21.6
    #17:    2  13 21.6
    #18:    2  15 21.6
    #19:    2  17 21.6
    #20:    2  22 21.6
    #21:    2  11 21.6
    #    plot dbh  dom
    

    dplyr also has slice_max function (previously top_n) to get top n values in each group.

    df %>%
      group_by(plot) %>%
      slice_max(dbh, n = 5) %>%
      summarise(dom = mean(dbh)) %>%
      left_join(df, by = 'plot')