I have a question similar to Calculate proportion of positives values by group, but for grouping the average fraction by many columns. I'd like to get the proportion of non-zero values in "num" by "Year" and "season". Something that works for n# of columns, no matter where they are in the df in relation to each other.
My data:
> head(df)
# A tibble: 6 x 6
Year Month Day Station num season
<fct> <dbl> <dbl> <dbl> <dbl> <fct>
1 2017 1 3 266 4 DRY
2 2018 1 3 270 2 DRY
3 2018 1 3 301 1 DRY
4 2018 1 4 314 0 DRY
5 2018 2 4 402 0 DRY
6 2018 1 4 618 0 WET
I thought something like this would work, but I get a warning message:
> aggregate(df$num>0~df[,c(1,6)],FUN=mean) # Average proportion of num > 0 per year & season
Error in model.frame.default(formula = env_subset$den > 0 ~ env_subset[, :
invalid type (list) for variable 'env_subset[, c(1, 6)]'
With dplyr
, I think this is what you want:
library(dplyr)
df %>% group_by(Year, season) %>%
summarize(prop_gt_0 = mean(num > 0), .groups = "drop")
# # A tibble: 3 × 3
# Year season prop_gt_0
# <int> <chr> <dbl>
# 1 2017 DRY 1
# 2 2018 DRY 0.5
# 3 2018 WET 0
It's usually better to refer to columns by name rather than by number, so, as you say it works "no matter where they are in the df".
You can still use aggregate
--I prefer the formula interface for working with column names:
aggregate(num ~ Year + season, data = df, FUN = \(x) mean(x > 0))
# Year season num
# 1 2017 DRY 1.0
# 2 2018 DRY 0.5
# 3 2018 WET 0.0