I have a dataset in R that contains methane measurements taken on different dates by of different cows. The dataset is structured as follows:
Each date can have multiple measurements, and some dates might not have any measurements at all, so they do not appear in the dataset. I need to calculate a 3-day centered rolling average of the methane measurements (ch4) for each animal (id), considering all available measurements within the date range.
Key Requirements: Multiple measurements per date: The dataset includes multiple measurements for some dates, but can have a completely different number for another date depending on how many times the animal went to the feed bin, and some days might have no measurements. If the date has no measurement it doesn’t actually appear as a row in the data, but I can easily add in those rows if necessary.
3-day centered rolling average: The rolling average should be calculated over a 3-day window centered on each date. For example, the rolling average for January 2nd should consider all measurements from January 1st, 2nd, and 3rd.
Group by animal: The rolling average calculation should be done separately for each animal (ID), ensuring that the calculations for one animal does not affect another.
Example Dataset
Here's a small sample of what the dataset might look like for one techid:
Meausre_date = c(2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-02, 2023-01-03 2023-01-03, 2023-01-05)
Ch4 = c(200, 250, 233, 256, 270, 256, 290, 299)
Desired output
For each techid, I want to generate a new column that includes:
To summarize, I need an efficient way to compute a 3-day centered rolling average of methane measurements for each animal, considering varying numbers of measurements per date and including dates with no measurements in the sequence.
All of the methods I have seen and attempted so far haven’t been able to deal with the different number of measures on each measure_date. (it is a very large dataset) example of dataset
I presume I am missing something simple or just attacking this from the wrong angle. (changed some variable names in the code and dataset which is why there are differences)
Using the data shown reproducibly in the Note at the end, define a function, Mean
, which makes centered means and then for each tech_id
complete the dates using complete
to insert missing dates and then run Mean
on that.
library (dplyr)
library(tidyr)
# x is times and y the values; returns rolling mean of 3 centered
Mean <- function(x, y) {
x <- as.numeric(x)
sapply(x, function(z) mean(y[x %in% seq(z-1, z+1)], na.rm = TRUE))
}
DF %>%
mutate(Measure_date = as.Date(Measure_date)) %>%
group_by(tech_id) %>%
complete(Measure_date = seq(min(Measure_date), max(Measure_date), by="day")) %>%
mutate(Mean = Mean(Measure_date, Ch4) ) %>%
ungroup
giving
# A tibble: 9 × 4
tech_id Measure_date Ch4 Mean
<dbl> <date> <dbl> <dbl>
1 1 2023-01-01 200 242.
2 1 2023-01-01 250 242.
3 1 2023-01-01 233 242.
4 1 2023-01-01 256 242.
5 1 2023-01-02 270 251.
6 1 2023-01-03 256 272
7 1 2023-01-03 290 272
8 1 2023-01-04 NA 282.
9 1 2023-01-05 299 299
DF <- data.frame(tech_id = 1,
Measure_date = c("2023-01-01", "2023-01-01", "2023-01-01", "2023-01-01",
"2023-01-02", "2023-01-03", "2023-01-03", "2023-01-05"),
Ch4 = c(200, 250, 233, 256, 270, 256, 290, 299))