rdplyrtidyversenormalizemutate

How to normalise two categories of season with an uneven number of months using the function mutate() in the dplyr package


During fieldwork, we collected data for counts of dolphins at coral reefs per month, per year. I have split my data into seasons for winter and summer.

This is my method for dplyr:

Step 1: Calculate the total sightings and average group size per reef and season

Step 1: The data for total sightings needs to be normalised by season as they have an uneven amount of months. Winter is 7 months and summer is 5 months.

We should obtain these two new columns

I can't share the original dataframe due to ownership issues

Many thanks if you can help.

However, when I run my code, I get this error message:

n length
`summarise()` has grouped output by 'Reef_Code'. You can override using the `.groups` argument.
Error in `mutate()`:
ℹ In argument: `Normalized_Sightings = Total_Sightings/season_months[Season]`.
ℹ In group 1: `Reef_Code = 1`.
Caused by error in `Total_Sightings / season_months[Season]`:
! non-numeric argument to binary operator
Run `rlang::last_trace()` to see where the error occurred.

R-code:

library(dplyr)

#Normalization for season, simple normalization based on length, 7 months for Winter and 5 for Summer

# Define months per season
season_months <- list("Winter" = 7, "Summer" = 5)

#Group by reef and season, then calculate total sightings, 
#normalize these for each season and calculate the average group size

# Group by Reef_Code and Season, then normalize
  result <- MyDf %>%
                 group_by(Reef_Code, Season) %>%
                 summarize(
                 Total_Sightings = n(),  # Count of sightings per reef and season
                 Avg_Group_Size = mean(Group_Size, na.rm = TRUE)) %>%  # Average group size
                 mutate(Normalized_Sightings = Total_Sightings / season_months[Season]) # Normalize by season length

Dummy Dataframe

structure(list(Reef_Code = c(1L, 2L, 3L, 1L, 1L, 3L, 2L, 4L, 
2L, 5L, 4L, 2L, 3L, 6L, 5L, 3L, 6L, 6L, 4L, 2L, 5L, 4L, 1L, 2L, 
3L, 4L, 6L, 1L, 1L, 2L, 3L, 6L, 5L, 3L, 6L, 6L, 4L, 2L, 5L, 4L, 
3L, 1L, 1L, 3L, 2L, 4L, 2L, 5L, 4L, 2L, 3L, 6L, 5L, 3L, 5L, 4L, 
2L, 3L, 6L), Season = c("Summer", "Summer", "Summer", "Summer", 
"Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Winter", 
"Winter", "Winter", "Winter", "Winter", "Winter", "Winter", "Winter", 
"Summer", "Summer", "Summer", "Summer", "Summer", "Summer", "Winter", 
"Winter", "Winter", "Winter", "Winter", "Winter", "Winter", "Winter", 
"Winter", "Summer", "Summer", "Summer", "Summer", "Summer", "Summer", 
"Winter", "Summer", "Summer", "Summer", "Summer", "Summer", "Winter", 
"Winter", "Winter", "Winter", "Winter", "Winter", "Winter", "Summer", 
"Summer", "Summer", "Summer", "Summer", "Summer", "Winter"), 
    Group_Size = c(7L, 11L, 1L, 14L, 16L, 2L, 5L, 5L, 5L, 8L, 
    8L, 6L, 6L, 1L, 8L, 8L, 4L, 5L, 1L, 5L, 5L, 14L, 8L, 7L, 
    7L, 18L, 25L, 2L, 5L, 5L, 8L, 8L, 6L, 6L, 1L, 8L, 8L, 5L, 
    14L, 8L, 7L, 7L, 18L, 25L, 2L, 5L, 5L, 8L, 8L, 6L, 6L, 1L, 
    8L, 7L, 8L, 8L, 6L, 6L, 1L)), class = "data.frame", row.names = c(NA, 
-59L))

Solution

  • I'd suggest using these two lines in place of the last mutate line:

      ...
      left_join(data.frame(Season = c("Winter", "Summer"),
                           season_months = c(7,5))) |>
      mutate(Normalized_Sightings = Total_Sightings / season_months)
    

    or

      mutate(Normalized_Sightings = Total_Sightings / if_else(Season == "Winter", 7, 5))
    

    or

      mutate(season_months = case_match(Season,
                                        "Winter" ~ 7,
                                        "Summer" ~ 5)) |>
      mutate(Normalized_Sightings = Total_Sightings / season_months)
    

    Note also that summarize's default will just remove the most recent grouping, so the output is still grouped by Reef_Code. This could potentially lead to unexpected results later if you expect the calculations will done in the context of the whole ungrouped data.

    To remove that grouping, you could add |> ungroup(), or add .groups = "drop" at the end of the summarize(). Or, my preference, skip the group_by and instead use .by = c(Reef_Code, Season) at the end of the summarize(). This will apply that grouping to the summarize() step alone, saving you the need to keep track of it.