rdplyrsummarize

Percentage of total counts by group in R


I'm trying to create an output that calculates the percentage of counts, out of total counts (in a data frame), by factor level, but can't seem to figure out how to retain the grouping structure in the output.

I can get the total counts that I want to divide by...

df %>% summarise(sum(num))
# 15

...and the total by group...

df %>% group_by(species) %>% summarise(sum(num))
# A tibble: 3 × 2
#   species                  `sum(num)`
#   <chr>                         <int>
# 1 Farfantepenaeus duorarum          4
# 2 Farfantepenaeus notialis          0
# 3 Farfantepenaeus spp              11

But I can't get it to get it to look like this...

# ???
#   species                     Percent
#   <chr>                         <int>
# 1 Farfantepenaeus duorarum       4 / 15 = 0.267
# 2 Farfantepenaeus notialis       0 / 15 = 0.000
# 3 Farfantepenaeus spp           11 / 15 = 0.733

The closest I got was this, but because I used reframe() it returns the ungrouped data

df %>% group_by(species) %>% 
  summarise(factor_count=sum(num)) %>% 
  # ungroup() %>% 
  # Wanring: # Please use `reframe()` instead., When switching from `summarise()` 
  # to `reframe()`, remember that `reframe()` always returns an ungrouped data
  reframe(percent=factor_count/sum(df$num))

# A tibble: 3 × 1
  percent
    <dbl>
1   0.267
2   0    
3   0.733

Data:

> dput(df)
structure(list(species = c("Farfantepenaeus notialis", "Farfantepenaeus spp", 
"Farfantepenaeus notialis", "Farfantepenaeus notialis", "Farfantepenaeus duorarum", 
"Farfantepenaeus duorarum", "Farfantepenaeus notialis", "Farfantepenaeus spp", 
"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus notialis", 
"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus notialis", 
"Farfantepenaeus notialis", "Farfantepenaeus spp", "Farfantepenaeus duorarum", 
"Farfantepenaeus spp", "Farfantepenaeus spp", "Farfantepenaeus duorarum", 
"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus spp", 
"Farfantepenaeus spp", "Farfantepenaeus notialis"), num = c(0L, 
0L, 0L, 0L, 1L, 0L, 0L, 2L, 0L, 3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 3L, 0L, 2L, 4L, 0L)), row.names = c(159897L, 174698L, 
236857L, 190237L, 327321L, 272931L, 304567L, 75538L, 109206L, 
351373L, 280332L, 163966L, 282183L, 341197L, 316962L, 354703L, 
343971L, 95333L, 244258L, 254061L, 87561L, 186908L, 221318L, 
258688L, 97737L), class = "data.frame")

Solution

  • Two steps: summarize group-totals, then percent-calcs on everything combined.

    library(dplyr)
    df %>%
      summarize(Percent = sum(num), .by = species) %>%
      mutate(Percent = Percent / sum(Percent))
    #                    species   Percent
    # 1 Farfantepenaeus notialis 0.0000000
    # 2      Farfantepenaeus spp 0.7333333
    # 3 Farfantepenaeus duorarum 0.2666667
    

    For your code: