I'd like to generate some simple calculations for each combination of two factor variables and store the results in a data frame. Here are the data:
df <- data.frame(SPECIES = as.factor(c(rep("SWAN",10), rep("DUCK",4), rep("GOOS",12),
rep("PASS",9), rep("FALC",10))),
DRAINAGE = as.factor(c(rep(c("Central", "Upper", "West"),15))),
CATCH_QTY = c(1,1,2,5,6,1,2,1,1,1,1,1,3,1,1,2,1,1,2,2,
TAGGED = c(rep("T",6),NA,"T","T",NA,rep("T",10),NA,NA,
RECAP = c(rep(NA,6),"RC",NA,NA,"RC",rep(NA,10),"RC","RC",
And here is the function:
myfunction <- function(dat, yr, spp, drain){
dat <- dat %>% filter(SPECIES == spp, DRAINAGE == drain)
estimatea <<-
dat %>%
summarise(NumCaught = sum(CATCH_QTY, na.rm = T),
NewTags = sum(!is.na(TAGGED)),
Recaps = sum(!is.na(RECAP)),
TotTags = sum(NewTags+Recaps))
dataTest1 <- cbind(yr, spp, drain, estimatea$NumCaught, estimatea$NewTags,
estimatea$Recaps, estimatea$TotTags)
I've mostly been experimenting with nested for loops and have been struggling to store the output in a dataframe given that the variables over which I am iterating are factors, rather than numeric, therefore a lot of the existing answers on stack exchange aren't relevant. The answer for iterating over factors here does not show how to store the output.
Some examples of my attempts:
out <- list()
for (i in seq_along(levels(df$SPECIES))) {
for (j in seq_along(levels(df$DRAINAGE))) {
out[i,j] <- myfunction(df, "2023", i, j)
Error in out[i, j] <- myfunction(df, "2023", i, j) :
incorrect number of subscripts on matrix
for (i in seq_along(levels(df$SPECIES))) {
for (j in seq_along(levels(df$DRAINAGE))) {
out[i+1,j+1] <- myfunction(df, "2023", i, j)
Error in out[i, j] <- myfunction(df, "2023", i, j) :
incorrect number of subscripts on matrix
I've also considered some non-for loop options, e.g.,
combos <- expand.grid(df$SPECIES, df$DRAINAGE) %>% distinct() %>%
drop_na() %>% rename(spp = Var1, drain = Var2)
test <- myfunction(df, "2023", combos$spp, combos$drain) #generates incorrect results
sapply(combos$spp, function(x) mapply(myfunction,x,combos$drain))
apply(combos, 2, FUN = myfunction)
Error in UseMethod("filter") :
no applicable method for 'filter' applied to an object of class "character"
Ideally, the output dataframe would look something like this:
desired_out <- data.frame(yr = rep("2023",3),
spp = c("DUCK", "DUCK", "GOOS"),
drain = c("West", "Central", "Upper"),
V4 = c(1,3,4),
V5 = c(1,1,3),
v6 = c(0,0,1),
V7 = c(1,1,4))
To get your desired output, dplyr
functions do what you need without getting into loops or *apply
df %>%
group_by(SPECIES, DRAINAGE) %>%
NumCaught = sum(CATCH_QTY, na.rm = T),
NewTags = sum(!is.na(TAGGED)),
Recaps = sum(!is.na(RECAP)),
TotTags = NewTags+Recaps
#> SPECIES DRAINAGE NumCaught NewTags Recaps TotTags
#> 1 SWAN Central 9 2 2 4
#> 2 SWAN Upper 8 3 0 3
#> 3 SWAN West 4 3 0 3
#> 4 DUCK Upper 2 2 0 2
#> 5 DUCK West 1 1 0 1
#> 6 DUCK Central 3 1 0 1
#> 7 GOOS West 4 3 1 4
#> 8 GOOS Central 6 3 1 4
#> 9 GOOS Upper 5 4 0 4
#> 10 PASS West 3 3 0 3
#> 11 PASS Central 3 3 0 3
#> 12 PASS Upper 3 2 1 3
#> 13 FALC West 4 3 1 4
#> 14 FALC Central 3 2 1 3
#> 15 FALC Upper 3 3 0 3
If your real-life requirements are more complicated and you really need to use some kind of looping I would recommend th *apply
family of functions. In your example, nested lapply()
s will get to the same result:
# the function can just return the summarised 1-row data frame,
# no need to update estimatea
myfunction <- function(dat, yr, spp, drain){
dat %>% filter(SPECIES == spp, DRAINAGE == drain) %>%
summarise(NumCaught = sum(CATCH_QTY, na.rm = T),
NewTags = sum(!is.na(TAGGED)),
Recaps = sum(!is.na(RECAP)),
TotTags = sum(NewTags+Recaps),
# nested lapply() to create lists of 1-row data frames
# (use levels(df$SPECIES) not seq_along() because we want
# the character strings, not the numeric index)
outputs <- lapply(levels(df$SPECIES),
function(x) {
function(y) myfunction(df, "2023", x, y)
# bind them together into 1 data frame
do.call(bind_rows, outputs)
#> SPECIES DRAINAGE NumCaught NewTags Recaps TotTags
#> 1 DUCK Central 3 1 0 1
#> 2 DUCK Upper 2 2 0 2
#> 3 DUCK West 1 1 0 1
#> 4 FALC Central 3 2 1 3
#> 5 FALC Upper 3 3 0 3
#> 6 FALC West 4 3 1 4
#> 7 GOOS Central 6 3 1 4
#> 8 GOOS Upper 5 4 0 4
#> 9 GOOS West 4 3 1 4
#> 10 PASS Central 3 3 0 3
#> 11 PASS Upper 3 2 1 3
#> 12 PASS West 3 3 0 3
#> 13 SWAN Central 9 2 2 4
#> 14 SWAN Upper 8 3 0 3
#> 15 SWAN West 4 3 0 3
Generally caution is advised when using the <<-
operator as it affects global variables outside of the function's scope. *apply
functions are normally cleaner to use in R.
As @Parfait points out in the comments by()
is the more appropriate function to use here. It subsets the data frame and does the summarisation without the need for nested lapply()
# the function to return the summarised 1-row data frame
# it doesn't need to filter the data frame as `by()` does that for us
myfunction <- function(dat, yr){
dat %>%
summarise(NumCaught = sum(CATCH_QTY, na.rm = T),
NewTags = sum(!is.na(TAGGED)),
Recaps = sum(!is.na(RECAP)),
TotTags = sum(NewTags+Recaps),
# by() to create lists of 1-row data frames
outputs <- by(df, df[, c("SPECIES", "DRAINAGE")], myfunction, yr = "2023")
# bind them together into 1 data frame
do.call(bind_rows, outputs)
#> SPECIES DRAINAGE NumCaught NewTags Recaps TotTags
#> 1 DUCK Central 3 1 0 1
#> 2 FALC Central 3 2 1 3
#> 3 GOOS Central 6 3 1 4
#> 4 PASS Central 3 3 0 3
#> 5 SWAN Central 9 2 2 4
#> 6 DUCK Upper 2 2 0 2
#> 7 FALC Upper 3 3 0 3
#> 8 GOOS Upper 5 4 0 4
#> 9 PASS Upper 3 2 1 3
#> 10 SWAN Upper 8 3 0 3
#> 11 DUCK West 1 1 0 1
#> 12 FALC West 4 3 1 4
#> 13 GOOS West 4 3 1 4
#> 14 PASS West 3 3 0 3
#> 15 SWAN West 4 3 0 3