I'm trying to create an output that calculates the percentage of counts, out of total counts (in a data frame), by factor level, but can't seem to figure out how to retain the grouping structure in the output.
I can get the total counts that I want to divide by...
df %>% summarise(sum(num))
# 15
...and the total by group...
df %>% group_by(species) %>% summarise(sum(num))
# A tibble: 3 × 2
# species `sum(num)`
# <chr> <int>
# 1 Farfantepenaeus duorarum 4
# 2 Farfantepenaeus notialis 0
# 3 Farfantepenaeus spp 11
But I can't get it to get it to look like this...
# ???
# species Percent
# <chr> <int>
# 1 Farfantepenaeus duorarum 4 / 15 = 0.267
# 2 Farfantepenaeus notialis 0 / 15 = 0.000
# 3 Farfantepenaeus spp 11 / 15 = 0.733
The closest I got was this, but because I used reframe() it returns the ungrouped data
df %>% group_by(species) %>%
summarise(factor_count=sum(num)) %>%
# ungroup() %>%
# Wanring: # Please use `reframe()` instead., When switching from `summarise()`
# to `reframe()`, remember that `reframe()` always returns an ungrouped data
reframe(percent=factor_count/sum(df$num))
# A tibble: 3 × 1
percent
<dbl>
1 0.267
2 0
3 0.733
Data:
> dput(df)
structure(list(species = c("Farfantepenaeus notialis", "Farfantepenaeus spp",
"Farfantepenaeus notialis", "Farfantepenaeus notialis", "Farfantepenaeus duorarum",
"Farfantepenaeus duorarum", "Farfantepenaeus notialis", "Farfantepenaeus spp",
"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus notialis",
"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus notialis",
"Farfantepenaeus notialis", "Farfantepenaeus spp", "Farfantepenaeus duorarum",
"Farfantepenaeus spp", "Farfantepenaeus spp", "Farfantepenaeus duorarum",
"Farfantepenaeus duorarum", "Farfantepenaeus spp", "Farfantepenaeus spp",
"Farfantepenaeus spp", "Farfantepenaeus notialis"), num = c(0L,
0L, 0L, 0L, 1L, 0L, 0L, 2L, 0L, 3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 3L, 0L, 2L, 4L, 0L)), row.names = c(159897L, 174698L,
236857L, 190237L, 327321L, 272931L, 304567L, 75538L, 109206L,
351373L, 280332L, 163966L, 282183L, 341197L, 316962L, 354703L,
343971L, 95333L, 244258L, 254061L, 87561L, 186908L, 221318L,
258688L, 97737L), class = "data.frame")
Two steps: summarize group-totals, then percent-calcs on everything combined.
library(dplyr)
df %>%
summarize(Percent = sum(num), .by = species) %>%
mutate(Percent = Percent / sum(Percent))
# species Percent
# 1 Farfantepenaeus notialis 0.0000000
# 2 Farfantepenaeus spp 0.7333333
# 3 Farfantepenaeus duorarum 0.2666667
For your code:
reframe
is not necessary (mostly when the number of rows changes, it can often be used in place of summarise
, but I haven't verified if/where the two differ significantly), and in fact here it will drop the species
columndf$
in a pipe that starts with df
: using df$num
ignores anything you've done since the start of the pipe, meaning that grouping, filtering, additions/changes, etc, are not available in that version of df
. There are certainly times when it is useful and even necessary, but they are few and far-between.