rloopsif-statementdplyrmean

Averaging two rows without grouping variable in R


I'm new to R and I am trying to average values in two rows into one of the two rows. There is always one value that will be needed for two averages (see more in example), which makes grouping impossible. The averages should be made in several columns (var1 to var2). Here is an example of the dataset; the original contains several more variables and many more months.

df <- read.table(text = "Month, var1, var2 
Apr, 5, 3 
Jun, 7, 2 
Aug, 3, 6 
Oct, 10, 4", header = TRUE, sep = ",")

What I want to achieve:

output <- read.table(text = "Duration, var1, var2
Apr-Jun, 6, 2.5
Jun-Aug, 5, 4
Aug-Oct, 6.5, 5 ", header = TRUE, sep = ",")

I have tried the following:

output <- df %>%
mutate("Duration" = Month) %>% 
mutate(Duration = case_when( Duration == "Apr" ~ "", 
Duration == "Jun" ~ "Apr-Jun", 
Duration == "Aug" ~ "Jun-Aug",
Duration == "Oct" ~ "Aug-Oct",  
TRUE ~ Duration) %>%
mutate(var1_mean = ifelse(Duration == "Apr-Jun", mean(var1[Month %in% c("Apr", "Jun")]), NA)) %>% 
mutate(var2_mean = ifelse(Duration == "Apr-Jun", mean(var2[Month %in% c("Apr", "Jun")]), NA)) %>%
mutate(var1_mean = ifelse(Duration == "Jun-Aug", mean(var1[Month %in% c("Jun", "Aug")]), NA)) %>% 
mutate(var2_mean = ifelse(Duration == "Jun-Aug", mean(var2[Month %in% c("Jun", "Aug")]), NA)) %>%
mutate(var1_mean = ifelse(Duration == "Aug-Oct", mean(var1[Month %in% c("Aug", "Oct")]), NA)) %>% 
mutate(var2_mean = ifelse(Duration == "Aug-Oct", mean(var2[Month %in% c("Aug", "Oct")]), NA))

But then I run into the problem that when I calculate var1_mean for Jun-Aug it overwrites the values from Apr-Jun. I also tried to directly insert the averaged values into var1 and var2, without creating the var1_mean and var2_mean columns, but then the problem is that for Jun-Aug I am using the averaged var1 and var2 values.

I thought that ideally I should have a code that calculates the mean values for each duration and loops through all the variables but I'm stuck at the overwriting part. Any help is very much welcomed.


Solution

  • Fixing the data to have only three columns:

    df <- read.table(text = "Month, sample, var1
    Apr, 5, 3 
    Jun, 7, 2 
    Aug, 3, 6 
    Oct, 10, 4", header = TRUE, sep = ",")
    

    We can do this:

    library(dplyr)
    df %>%
      summarize(
        across(where(is.character), ~ paste(.[-n()], .[-1], sep = "-")), 
        across(where(is.numeric), ~ (.[-n()] + .[-1]) / 2)
      )
    #     Month sample var1
    # 1 Apr-Jun    6.0  2.5
    # 2 Jun-Aug    5.0  4.0
    # 3 Aug-Oct    6.5  5.0
    

    If you would rather specify the columns by-name instead of by-class, we can modify that to be

    df %>%
      summarize(
        across(c(Month), ~ paste(.[-n()], .[-1], sep = "-")), 
        across(c(sample, var1), ~ (.[-n()] + .[-1]) / 2)
      )
    

    which means if you have other numeric aggregation to do, you can have different methods. Note that all of these are converting n rows to n-1 rows, so you have to do something with columns you want to preserve. (The exception being "grouping variables", which can be handled specially by using .by= or similar methods.)