I've been asked to generate something like this:
library("tidyverse")
df <- tibble(
comparison_var = c(1:10),
var_to_sum = runif(10, max = 10),
other_var = runif(10, max = 50)
)
summary_tbl <- df |>
summarize(
other_summary = mean(other_var),
sum_gt2 = sum((comparison_var > 2) * var_to_sum),
sum_gt3 = sum((comparison_var > 3) * var_to_sum),
#...,
sum_gtN = sum((comparison_var > 9) * var_to_sum)
)
Without repeating myself. In words, generate an arbitrary number of columns calculated with a (single) simple varying parameter value. Technically, I already know some ways to get this done:
(comparison_var > n)
, then across(starts_with(...), ...)
.
for
loop or apply
/map
call? I.e. can the looping be done, in pipe chain, inside mutate(...)
?summarize()
within a loop, for (cutoff in 1:N) {...}
, I guess _join()
ing the results onto the existing summary table.across()
, or some maybe some map
/apply
thing that returns a dataframe.Some of these are not horribly inelegant, but it really feels like this should be something super basic. It's the exact same idea as across()
, except across an external vector rather than a number of columns. In fact, every time I try to google for it, all I find are questions about how to use across
.
I would really like something that doesn't involve generating extra columns, or looping, or really, ideally, leaving a pipe chain.
I did notice that a value of summarize()
or mutate()
can be a tibble itself, generating more than one column at once. That led to writing this:
worldpop_educ_splits <- merged_tbl |>
summarize(
across(
starts_with("yr_sch_gt"),
~ sum(.x * pop, na.rm = TRUE) / sum(pop, na.rm = TRUE),
.names = "fracwt_{.col}"
),
{
tbl <- tibble(.rows = 1)
for (cutoff in 2:9) {
tbl[[1, glue("fracwt_yrsch_gt{cutoff}")]] <-
sum((yr_sch > cutoff) * pop, na.rm = TRUE) / sum(pop, na.rm = TRUE)
}
tbl
},
.by = year
)
Which technically works, but is abhorrent, and, as you can guess, insanely slow. I left the the first across()
call in the example as it is the same calculation that I want to do in the 2nd one, but for a variable which I already had an indicator for each cutoff level. I just would prefer not to make an indicator for this one too, and would like to stop doing so for the first as well.
Yes, I can just mutate up some columns, then drop them later or let them disappear with summarize()
. It just feels like there has to be a very obvious, elegant way to do this that I'm missing.
Thanks much, in advance, for your input!
I'd also go with the summrise-function-returning-a-dataframe route, but I think you are overthinking it a bit. This should do, works with groups as well:
library(tidyverse)
df |>
summarize(
other_summary = mean(other_var),
sum_gt = 2:9 |>
set_names() |>
map(\(gt) sum(var_to_sum[comparison_var > gt])) |>
bind_cols(),
) |>
unnest_wider(sum_gt, names_sep = "")
#> # A tibble: 1 × 9
#> other_summary sum_gt2 sum_gt3 sum_gt4 sum_gt5 sum_gt6 sum_gt7 sum_gt8 sum_gt9
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 29.5 45.1 42.2 33.9 27.5 22.3 15.0 13.6 7.05
Reference:
df |>
summarize(
other_summary = mean(other_var),
sum_gt2 = sum((comparison_var > 2) * var_to_sum),
sum_gt3 = sum((comparison_var > 3) * var_to_sum),
#...,
sum_gtN = sum((comparison_var > 9) * var_to_sum)
)
#> # A tibble: 1 × 4
#> other_summary sum_gt2 sum_gt3 sum_gtN
#> <dbl> <dbl> <dbl> <dbl>
#> 1 29.5 45.1 42.2 7.05
Example data:
set.seed(42)
df <- tibble(
comparison_var = c(1:10),
var_to_sum = runif(10, max = 10),
other_var = runif(10, max = 50)
)
df
#> # A tibble: 10 × 3
#> comparison_var var_to_sum other_var
#> <int> <dbl> <dbl>
#> 1 1 9.15 22.9
#> 2 2 9.37 36.0
#> 3 3 2.86 46.7
#> 4 4 8.30 12.8
#> 5 5 6.42 23.1
#> 6 6 5.19 47.0
#> 7 7 7.37 48.9
#> 8 8 1.35 5.87
#> 9 9 6.57 23.7
#> 10 10 7.05 28.0
bind_cols()
can be replaced with list()
to avoid some tibble overhead & gain some slight performance boost, but difference is surprisingly small, around 2ms on my machine when testing with 100000 rows.