I'm a long-term developer but somewhat new to the R language. I'm trying to write some clean and maintainable code. I know how to do this in multiple languages but not R.
I've got an R function that performs the same action for different fields.
# working code but not DRY
summarize_dataset_v2 <- function ( data, dataset ) {
switch ( dataset,
hp = {
data %>%
group_by ( cyl ) %>%
summarize ( hp = mean ( hp ),
num = n ( ) ) ->
summarized_composite
},
wt = {
data %>%
group_by ( cyl ) %>%
summarize ( wt = mean ( wt ),
num = n ( ) ) ->
summarized_composite
},
stop ( "BAD THINGS" ) )
return ( summarized_composite )
The actual code has 6-8 variants with more logic. It works but by being non-DRY it is a bug ready to happen.
Conceptually what I want looks something like this:
switch ( dataset,
hp = { field_name = "hp" },
wt = { field_name = "wt" },
stop ( "BAD THINGS" ) )
data %>%
group_by ( cyl ) %>%
summarize ( *field_name = mean( *field_name ),
num = n( )
) ->
summarized_composite
return( summarized_composite )
}
The *field_name
construct is just there to illustrate that I'd like to parameterize that common code. Maybe currying that summarize
statement would work. I'm using the tidyverse stuff but I'm open to using another package to accomplish this.
Edit #1: Thanks for the answers (https://stackoverflow.com/users/12993861/stefan, https://stackoverflow.com/users/12256545/user12256545)
I've applied both answers to my example code and understand (I think) how they work. The one from stefan matches my experience in other languages. The one from user12256545 comes from a different POV and shifts focus to the caller, giving it more power. I haven't done a lot of formula-based code so this is a chance to explore that facet.
I'm going to apply both approaches to my actual problem to see how they feel. I'll respond with the results in a few days.
Thank you both.
Edit #2: When I applied these two approaches to my actual code I found that the one by stefan matched my mental model of how this would work. I accepted that as an answer.
Thanks!
One approach to get rid of the duplicated code may look like so. First, switch
is not necessary. Instead you could make use of the .data
pronoun to pass column names as strings. Additionally I make use of some glue
syntax and the walrus operator :=
to name the "mean" column according to the column name passed as an argument:
library(dplyr)
summarize_dataset_v2 <- function(data, dataset) {
if (!dataset %in% c("hp", "wt")) stop("BAD THINGS")
data %>%
group_by(cyl) %>%
summarize(
"{dataset}" := mean(.data[[dataset]]),
num = n()
)
}
summarize_dataset_v2(mtcars, "hp")
#> # A tibble: 3 × 3
#> cyl hp num
#> <dbl> <dbl> <int>
#> 1 4 82.6 11
#> 2 6 122. 7
#> 3 8 209. 14
summarize_dataset_v2(mtcars, "wt")
#> # A tibble: 3 × 3
#> cyl wt num
#> <dbl> <dbl> <int>
#> 1 4 2.29 11
#> 2 6 3.12 7
#> 3 8 4.00 14
summarize_dataset_v2(mtcars, "disp")
#> Error in summarize_dataset_v2(mtcars, "disp"): BAD THINGS