rfilterdplyrgroupingtop-n

dplyr - How to filter the top n groups with more (sum) sales?


I am using dplyr on R and I am trying to filter a tibble which contains transactional data.

The columns of my interest are "Country" and "Sales".

I have a lot of countries and for exploration purposes I want to analyze only the TOP 5 countries with most sales.

The trouble here is that if I do some grouping, it will not work for me, as I need all the rows for further analysis purposes (transactional data).

I tried something like:

trans_merch_df %>% group_by(COUNTRY) %>% top_n(n = 5, wt = NET_SLS_AMT)

But it's completely off.

Let's say I have this:

trans_merch_df <- tibble::tribble(~COUNTRY, ~SALE,
                                  'POR',     14,
                                  'POR',     1,
                                  'DEU',     4,
                                  'DEU',     6,
                                  'POL',     8,
                                  'ITA',     1,
                                  'ITA',     1,
                                  'ITA',     1,
                                  'SPA',     1,
                                  'NOR',     50,
                                  'NOR',     10,
                                  'SWE',     42,
                                  'SWE',     1)

The result I am expecting is:

COUNTRY   SALE
POR       14
POR       1
DEU       4
DEU       6
POL       8
NOR       50
NOR       10
SWE       42
SWE       1

As ITA and SPA are not in the TOP 5 of sales.

Thanks a lot in advance.

Cheers!


Solution

  • A different dplyr possibility could be:

    df %>%
     add_count(COUNTRY, wt = SALE) %>%
     mutate(n = dense_rank(desc(n))) %>%
     filter(n %in% 1:5) %>%
     select(-n)
    
    
      COUNTRY  SALE
      <chr>   <int>
    1 POR        14
    2 POR         1
    3 DEU         4
    4 DEU         6
    5 POL         8
    6 NOR        50
    7 NOR        10
    8 SWE        42
    9 SWE         1
    

    Or even more concise:

    df %>%
     add_count(COUNTRY, wt = SALE) %>%
     filter(dense_rank(desc(n)) %in% 1:5) %>%
     select(-n)